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PREFACE 


[е book is an outgrowth of lectures on sample surveys which 
the author has delivered since 1945 at the Indian Council of 
Agricultural Research, subsequently at the International School 
on Censuses and Statistics in 1949-50 held at Delhi under the 
auspices of the Food and Agriculture Organization of the United 
Nations, at the two summer sessions conducted by the Indian 
Society of Agricultural Statistics in 1950 and 1951, and finally at 
the Statistical Laboratory of the Iowa State College, Ames, 
Iowa, U.S.A., in the spring of 1952. 


There was no plan at first of publishing a book and the notes 
prepared for the lectures were mimeographed for the use of the 
students, but as the scope of the course was gradually enlarged, 
suggestions were received that the lectures should be published 
in the form of a text for teaching at colleges and universities. 
It was felt that this publication would fulfil a real need for 
a systematic treatment of the sampling theory in relation to 
large-scale surveys. About the same time the Conference of the 
Food and Agriculture Organization of the United Nations recom- 
mended at its Sixth Session that a book be prepared incorporating 
a comprehensive treatment of the sampling theory of surveys and 
its applications so as to be of direct assistance to the sampling 
experts working in various countries in their efforts to introduce 
the sampling method for improvement of agricultural statistics. 
The mimeographed notes were accordingly reorganized and ampli- 
fied to include illustrative material on agricultural surveys from 
different countries; the publication of the present book is the 
result. 


In keeping with its objectives the book is primarily designed 
to serve the needs of a text for teaching an advanced course 1n 
sampling theory of surveys and of a reference book for statisticians 
entrusted with the planning of surveys for collecting statistics. 
Every attempt has been made to present all the modern develop- 
ments of sampling theory which are of importance in survey 
work. Some of the results have already appeared in the papers 
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published in the Journal of the Indian Society of Agricultural 
Statistics. These might appear new to many readers since they 
might not have seen this Journal. The book also gives a number 
of results which are being published for the first time. Among 
these should be mentioned particularly the algebraic treatment of 
non-sampling errors whose importance relative to sampling errors 
“has not been sufficiently stressed in the literature on the subject. 


In order that the theory presented in the book should be of 
direct assistance in practice, it is illustrated with examples of actual 
surveys so as to serve the special needs of under-developed 
countries in the field of sampling, as recommended by FAO. 
These examples are oriented largely around agricultural statistics, 
in keeping with the author's experience in this field and FAO's 
interest, and relate to surveys for the estimation of crop acreage, 
yield, incidence of insect pests on crops, livestock numbers and 
their products, other farm facts and fisheries production. The 
author is conscious that these examples by themselves will not 
meet the entire needs of sampling workers, particularly those from - 
the economically less developed countries where the resources 
available for planning surveys are meagre and a large majority 
of the people are illiterate, do not appreciate the purpose of the 
inquiry, nor know the correct answers to the questions put to 
them. The contribution to the total error in the result arising 
from this latter factor is very large in these countries and 
emphasizes the great value of developing satisfactory measurement 
techniques before attempting nation-wide surveys. The relevant 
theory bearing on this question has been discussed in Chapter X. 
What is further needed is a simple exposition of a few typical 
surveys. Such a book is nearing completion and it is hoped 
to make it available soon. 


The need for keeping the volume within reasonable size has 
prevented any elaborate supporting description of the theory and 
examples given in the book. The author's aim all along has been 
to present the theory in as straightforward a manner as possible. 
The only pre-requisites are college algebra, elements of calculus 
and principal statistical methods such as those covered in 
Statistical Methods for Agricultural Workers, by V. G. Panse 
and the author. Even so the author is aware that at places the 
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treatment has become too terse. Such sections have been marked 
with an asterisk to indicate that the portion can be left over from 
the first reading without losing the continuity of the text. 


The author has received considerable assistance in preparing 
the book from his former colleagues in India. First of all he 
gratefully acknowledges the encouragement and help which he 
received from his former Chief, Mr. P. M. Kharegat, then 
Secretary to the Ministry of Agriculture, Government of India, 
to whose farsightedness are principally due the advances which 
India has made in the field of sampling. Не is indebted to 
Messrs. V. G. Panse, G. R. Seth, K. Kishen, R. D. Narain, 
O. P. Aggarwal and B. V. Sukhatme who read parts of the 
manuscript and made numerous suggestions to improve the 
presentation; to Messrs. К. 5. Krishnan, 5. Н. Ayer and 
K. V. R. Sastry who worked through the examples; and to 
Mrs. Evans of the Statistics Branch of FAO who checked through 
them and also helped in the preparation of the index to the 
book; to Dr. P. N. Saxena who shouldered a particularly heavy 
responsibility of reading critically the manuscript and the proofs; 
and to Suzanne Brunelle and Mary Nakano for their typing and 
secretarial help. The author also likes to express his thanks to 
Dr. T. A. Bancroft, Dr. D. J. Thompson and other members of 
the staff of the Statistical Laboratory, Iowa State College, with 
whom he had the opportunity to work as visiting professor during 
the spring term of 1952 and to Marshall Townsend of the Iowa 
State College Press for their interest and encouragement in the 
publication of the book. Last but not least the author is 
indebted to Mr. Norris E. Dodd, Director-General of the FAO, 
who invited the author to come to FAO to head the Statistics 
Branch, which gave him the opportunity to appreciate more fully 
the urgent need for promoting sampling for improving agricultural 
statistics in under-developed countries; and to Dr. A. H. Boerma, 
Director of Economics Division of the FAO, for his constant 


encouragement and advice. 


September 1953. PANDURANG V. SUKHATME. 
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SPECIAI. SYMBOLS 


Coefficient of variation 

Covariance 

Cost function 

Coefficients of cost functions 

Mean square error 

Standard error 

First approximation to variance 

Second approximation to variance 

Variance of a stratified sample 

Variance of а stratified sample, proportional 
allocation 

Variance of a stratified sample, Neyman allo- 
cation 

Variance of a random sample 

Variance of an unstratified sample 

Variance of a systematic sample 


CHAPTER I 
BASIC IDEAS IN SAMPLING 


11 Sampling Method 


A sampling method is a method of selecting a fraction of the 
population in a way that the selected sample represents the 
population. Everyone of us has had occasion to use it. It is 
almost instinctive for a person to examine a few articles, preferably 
from different parts of a lot, before he or she decides to buy it. 
No particular attention is, however, paid to the method of 
choosing articles for examination. A wholesale buyer, on the 
other hand, has to be careful in selecting articles for examination 
as it is important for him to ensure that the sample of articles 
selected for examination is typical of the manufactured product 
lest he should incur in the long run a heavy loss through wrong 
decision. Similarly, in obtaining information on the average 
yield of a crop by sampling, it is not sufficient to ensure that the 
fields to be included in the sample come from different parts of the 
country, for, the sample may well contain a very much larger 
(or smaller) proportion of fields of a particular category like 
irrigated, manured or growing improved variety, than is present 
in the population. If any category is consistently favoured at the 
expense of the other, the sample will cease to represent the whole. 
Even if,the sample is selected in such a way that the proportions 
in the sample under different categories agree with those in the 
population, the sample may not still represent the population. 
A sampling method, if it is to provide a sample representative 
of the population, must be such that all characteristics of the 
population, including that of variability among units of the 
population, are reflected in the sample as closely as the size of 
the sample will permit, so that reliable estimates of the population 
characters can be formed from the sample. 


1.2 Standard Error 


Whatever be the method of selection, a sample estimate ‘will 
inevitably differ from the one that would be obtained from 
enumerating the complete populaton with equal care. This 
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difference between the sample estimate and the population value 
is called the sampling error. The larger the sample, the smaller 
will obviously be the sampling error on the average and the 
greater will be our confidence in the results. A sampling method, 
if it is to be serviceable, must provide some idea of the sampling 
error in the estimate on an average. We must, for instance, be 
able to form a precise idea of the extent to which we are likely 
to be in error on an average in estimating the yield of a crop 
from the sample. Several measures are available for the purpose. 
One such measure of the average magnitude of the sampling error 
is called the standard error of the estimate and provides a measure 
of the reliability, as it were, of the sample estimate. It is the 
magnitude of the standard error which will determine whether 
a sample estimate is useful for a given purpose. This, in turn, 
will depend upon the break down expected of the results. If, for 
example, estimates of crop acreages are required for every village 
of the State and for all crops, major or minor, there will be little 
point in using the sampling method. 


1.3 Principle of Choosing among Alternative Sampling Methods 


Practical considerations have also to be kept in view in the 
use of a sampling method. Crop-cutting surveys for estimating 
the average yield of a crop provide a good example for illustration. 
It is not enough in a crop-cutting survey to select a sample of 
fields representative of the total number under the crop and 
sample-harvest the selected fields at the time of the visit. of the 
enumerator; it is also necessary to ensure that the selected fields 
are reached on the dates the cultivators would harvest them. 
Only then would the distribution of sample-harvesting over time 
correspond to the distribution over time of actual harvesting. 
The procedure of sample-harvesting should also correspond, in so 
far as practicable, to the one adopted by the cultivator so that 
what is observed would correspond to what is gathered by the 
cultivator, which is what one wants to estimate. Further, a 
sampling method, if it is to be acceptable in practice, must be 
simple, fit into the administrative background and local condi- 
tions and ensure the most effective use of the resources available 
to the sampler. The guiding principle in the choice of a sampling 
method is, in fact, the principle of securing the desired result 
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with the reliability required at minimum cost or with the maximum 
reliability at a given cost, making the most effective use of the 
resources available. 


1.4 Probability Sampling 


To fulfil the above requirements, it is necessary that the method 
of sampling be objective, based on laws of chance. The method 
is called the method of probability sampling. In this method, 
the sample is obtained in successive draws of a unit each with 
a known probability of selection assigned to each unit of the 
population at the first draw. At any subsequent draw, the prob- 
ability of selecting any unit from among the available units at 
that draw may be either proportional to the probability of selecting 
it at the first draw or completely independent of it. 


Тһе successive draws of a probability sample may be made 
with or without replacing the units selected in the previous 
draws. The former is called the procedure of sampling with 
replacement, the latter without replacement. 

The application of the method presumes that the population 
can be subdivided into distinct and identifiable units called sampl- 
ing units. The units may be natural units, such as individuals in 
a human population or fields їй a crop-estimating survey or natural 
aggregates of such units like families or villages; or they may be 
artificial units, such as a single plant, a row of plants or a plot 
of size, say, 10x 10 square feet in sampling a field of wheat. In 
general,*for a given proportion of the population to be sampled, 
the smaller the sampling unit the more accurate will be the 
sample estimate. The application of the method naturally pre- 
supposes the availability of a list of all the sampling units in the 
population. This list is called the frame and provides the basis 
for the actual selection of the sample. An example of a frame is 
furnished by a list of farms, where one exists, or suitable area- 
segments like the village in India or the section in the United 
States. The section forms the sampling unit and provides the 
means for selecting a sample of farms. 


15 Simple Random Sampling 


The simplest of the methods of probability sampling which 
provides estimates of the population characters and a measure of 
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the reliability of the estimates made is the method of simple 
random sampling. In this method, usually called for brevity the 
method of random sampling, an equal probability of selection is 
assigned to each unit of the population at the first draw. The 
method implies an equal probability of selecting any unit from 
among the available units at subsequent draws. Thus, if the 
number of units in the population is N, the probability of selecting 
any unit at the first draw will be 1/N, the probability of selecting 
any unit from among the available units at the second draw is 
1/(N — 1), and so on. 


An important property of simple random sampling is that the 
probability of selecting a specified unit of the population at any 
given draw is equal to the probability of selecting it at the first 
draw. For, let 


n denote the number of units to form the sample. 


The probability that the specified unit is selected at the r-th draw 
is clearly the product of (1) the probability of the event that it 
is not selected in any of the previous r — 1 draws; and (2) the 
probability of the event that it is selected at the r-th draw. The 
probability that it is not selected at the first draw is, by 
definition, (№ — 1)/N; that it is not selected at the second draw 
(М — 2)(N — 1), and so on. The probability of event (1) is, 
therefore, 


N—1 N-—2 _М—'+1 


М ` N-i N=r+2 ы 
The probability of the event (2) is clearly 1/(N — r + 1). The 
product of the two is, therefore, 


М—1 N-2 Мт - 1 pd 
N Nal? mci UNTEN Вт 


which is the probability of drawing the specified unit at the first 
draw. 

` Since the specified unit may be included in the sample at any 
of the п draws, it also follows that the probability that it is 
included in the sample is the sum of the probabilities that it is 
selected in the first draw, second draw, ..., n-th draw, and is, there- 
fore; gqual.to: n|N. Since this result is independent of the specified 
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unit, it follows that every one of the units in the population has 
the same chance of being included in the sample under the 
procedure of simple random sampling. This, in fact, has some- 
times been used as the definition of simple random sampling. 
However, this definition does not completely specify the procedure 
of simple random sampling, for, as will be clear in Chapter IX, 
there can be other procedures of sampling which do not give the 
same chance of selection at the first draw to each unit of the 
population and yet the probability that any кес 5 unit is 
included in the sample is n/N. 


The method of simple random sampling is also equivalent to 
giving an equal probability to each possible cluster of п units 
to form the sample of the population. The possible clusters of 


n are 
N 
n 


Random sampling implies that every one of these possible clusters 
will have an equal probability, namely, 


1 - 
N 
(а) 
of being selected as the sample. Thus, if the population consists 
of 4 farms serially numbered 1, 2, 3 and 4, having 2, 3, 4 and 


7 acres,under corn respectively, then the possible clusters of 2 farms 
from this population will be the following six: 


Serial Number of Serial Number of Units Values of the Units 
Cluster in the Cluster in the Cluster_ 
1 1,2 2,8 
2 1,3 2,4 a 
3 1,4 У 
4 2,3 3,4 
5 2,4 377. 
6 3,4 4,7 
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Random sampling implies that every one of these 6 clusters of 
2 each will have a chance of } of being selected as the sample 
for our study. It is easy to establish this result. 


The probability of selecting any one unit at the first draw is, 
by definition, 1/N. Having selected one, the probability of selecting 
any one of the remaining units at the second draw is clearly 
1/(N — 1), and so on. The probability of selecting any given л 
units in succession in a specified order is thus 


1 1 1 


N М—<17°7 М—п +1 
Since the order in which the units are selected is immaterial, the 
probability of any given п units to form the sample is thus 
given by 
n! 


! 1 
NOU ees e d 
z) 


Every one of the | ) possible clusters of n each has thus an 
equal probability of being selected under this method. 


The word “random” refers to the method of selecting a sample 
rather than to the particular sample selected. Any possible sample 
can be a simple random sample, however unrepresentative it may 
appear, so long as it is obtained by following the rule of giving 
an equal chance to every one of the possible samples. Thus, a 
person may draw a sample of 13 cards from a well-shuflled pack 
and still find that all are of the same suit. The sample is obviously 
unrepresentative of different colours, but nevertheless must be 
considered to be a random sample by virtue of the method of 
selection employed. 


1.6 Procedure of Selecting a Random Sample 


A. practical procedure of selecting a random sample is by using 
a table of random numbers, such as those published by Tippett 
(1927), a page from which is reproduced in the Appendix to this 
chapter. The procedure takes the form of (a) identifying N units 
in the population with the numbers 1 to N, or what is the same 
thing, preparing a list of units in the population and serially 
numbering them, (0) selecting different numbers from the table 
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of random numbers, and (c) taking for the sample the п units 
whose numbers correspond to those drawn from the table of 
random numbers. The following examples will illustrate the 
procedure: 


Example 1.1 

Select a sample of 34 villages from a list of 338 villages. 

Using the three-figure numbers given in columns 1 to 3, 4 to 6, 
etc., of the table given in the Appendix and rejecting numbers 
greater than 338 (and also the number 000), we have for the 
sample: 

125, 326, 12, 237, 35, 251, 165, 191 198; 
33, 161, 209, 51, 52, 331, 218, 337, 263, 
223, 241, 277, 42, 14, 303, 40, 99, 102, 
178, 197, 321, 835, 155, 165, 81, 

The procedure involves the rejection of a large number of 

random numbers, nearly two-thirds. A device commonly employed 
to avoid the rejection of such large numbers is to divide a random 
number by 338 and take the remainder as equivalent to the 
corresponding serial number between 1 to 337, the remainder zero 
corresponding to 338. It is, however, necessary to reject random 
numbers 677 to 999 and also 000 in adopting this procedure as 
otherwise villages with serial numbers 1 to 323 will get a larger 
chance of selection equal to 3/999 while those with serial numbers 
324 tq 338 will get a chance equal to 2/999. If we use this 
procedure and also the same three-figure random numbers as given 
in columns 1 to 3, 4 to 6, etc., we will obtain the sample of 
villages with serial numbers given below: 
125, 206, 326, 193, 12; 237, 95; 251; 
325, 338, 114, 231, 78, 112, 126, 330, 
312, 165, 131, 198, 33, 161, 209, 51, 

52, 331, 218, 337, 238, 323, 263, 90, 

11, 223. Е 


Example 1.2 
The following procedure has been used for selecting a sample 
of fields for crop-cutting experiments on paddy in the surveys 
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carried out by the Indian Council of Agricultural Research 
(1951). 

“Against the name of each selected village are shown three 
random numbers smaller than the highest survey number* in the 
village. Select the survey numbers corresponding to given random 
numbers for experiments. If the selected survey number does not 
grow paddy, select the next higher paddy-growing survey number 
in its place." 

Examine whether the above method will provide an equal 
chance of inclusion in the sample to all the paddy-growing survey 
numbers in the village, given the following: 


1. Name of village - .. Payagpur 
2. Total number of survey numbers 290 
3. Random numbers «s .. 18, 189, 239 


4. Paddy-growing survey numbers .. 49 to 88 and 189 to 290 

Clearly, according to instructions, the survey numbers to be 
selected for crop-cutting experiments will be 49, 189 and 239. In 
selecting the first paddy-growing survey number for experiment, we 
thus give the survey number 49 a chance of 49/290 of being included 
in the sample, the survey number 189 a chance of 101/290, while 
to the remaining paddy-growing survey numbers a chance of 1/290 
each. If paddy is grown in patches covering several survey numbers, 
as in the present example, the method will result in giving a larger 
chance to the border fields of being included in the sample. 
Example 1.3 Р 


Nine villages іп a certain administrative area contain 793, 170, 
970, 657, 1721, 1603, 864, 383 and 826 fields respectively. Make a 
random selection of 6 fields, using the method of random sampling. 

The total number of fields in all the 9 villages is 7987. The 
first step in the selection of a random sample of fields is to 
have these serially numbered from 1 to 7987, by taking successive 
cumulative totals: 

793, 963, 1933, 2590, 4311, 5914, 6778, 7161, 7987, 
the 793 fields in village 1 being given the serial numbers 1 
through 793, the 170 fields in village 2 being given the serial 


* Each field or separate riece of land in a village bears an official number which 
is termed the ‘survey number’. 
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numbers 794 through 963, and so on. A reference to the four-digit 
random numbers in columns 9 to 12 will then give the following 
sample of fields with serial numbers 7358, 922, 4112, 3596, 633 
and 3999. The corresponding fields will be No. 197 from village 9, 
No. 129 from village 2, No. 1522 and No. 1006 from village 5, 
No. 633 from village 1 and No. 1409 from village 5. 


It will be noted that the selection has actually proceeded in 
two stages, selecting a village in the first instance with probability 
proportional to the number of fields in the village and choosing, 
on the basis of the random number already ‘selected, a field in 
the selected village, villages being sampled with replacement. 
It must, however, be remembered that this equivalence between 
the one- and two-stage sampling holds good only when the 
number of second-stage units to be selected from а first-stage 
unit of sampling is limited to one. 


Example 1.4 


The following method is laid down for locating and marking 
а random plot of area 33' 33’ in a field selected for crop-cutting 
| experiments in India (1.С.А.Й., 1951). 

“Stand facing North with the field in front of you and to your 
right. Measure the length and the breadth of the field in feet and 
deduct 33’ from each, Select a pair of random numbers less than 
or equal to the remainders so obtained to locate the corner of the 
plot. Fix a peg at this corner, tie a string to it and stretch it 
along the length of the field away from the South-West corner 
of the field. Measure 33’ along it by means of a tape and put 
the cross-staff at this point. Turn the string round the cross-staff 
and stretch it at right-angles away from the South-West corner 
of the field and measure 33’ along it. Proceed in this manner 
until you reach the starting point of the plot by checking the 
distance between the fourth and the first corner.” 


Examine whether the method will give an equal chance to all 

the unit areas in the field, it being given that the field is a 
rectangular field, measuring 120’ x 100’. 

| The method implies a division of the field into (120-33) x (109-33) 
plots, each measuring 33'x33', which are not distinct but over- 
lapping. The fundamental requirement that the population should 
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be divisible into units which are distinct so that every unit area 
of the population belongs to one and only one sampling unit is 
thus not fulfilled, with the consequence that the central areas get 
a relatively greater chance of selection than those near the 
border. 


1.7 Non-Random Methods of Sampling 


Methods of sampling, which are not based on laws of chance 
but in which units of the population to be included in the sample 
are determined by the personal judgment of the enumerator, are 
called purposive or non-random methods. An example of this 
method, where personal judgment is introduced in tbe selection 
of a sample, is provided by the old official method in India of 
selecting fields for sample-harvesting for determining the average 
yield of a crop. Under this method, the experimenter was required 
to select fields which, in his judgment, had an average crop. It 
was found that the experimenter tended to select fields which were 
poorer than the average when the season was good and better 
than the average when the season was bad. The result was a 
tendency to over-estimate yields in bad years and to under-estimate 
them in good years. The quota method of sampling, so exten- 
sively used in the United States of America in opinion surveys, 
is another example of this method. Неге quotas are set up for 
the different categories of the population to be included in the 
sample and the selection of units from each category is left to the 
personal discretion of the enumerator. The method is conyenient 
to use in practice. Tts cost is also low relative to that of the 
method of probability sampling. However, the sample does not 
provide any means of judging the reliability of the estimates based 
thereon. If we want to have unbiased estimates of the population 
character whose accuracy can be measured from the samples 
themselves, probability sampling alone should be used. 


1.8 Non-Sampling Errors 


The accuracy of a result is affected not only by sampling errors 
arising from chance variation in the selection of the sample but also 
by (a) lack of precision in reporting observations, (0) incomplete 
or faulty canvassing of a designated random sample, and (с) 
faulty methods of estimation. These errors, particularly those 
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under (a) and (b), are usually grouped under the heading “поп- 
sampling errors". Deming (1944, 1950) has listed the different 
sources of errors and biases arising from (a). These, in his words, 
are principally due to arbitrariness in definition and variable 
performance of the man. An eye-estimate of the crop provides 
an example of this source of errors. — Eye-estimate is a form of 
measurement which cannot, in the very nature of things, give 
a unique result even when the same field is observed at different 
times by the same enumerator. The result will depend upon the 
personal judgment of the enumerator, no matter how well he is 
trained and consequently there will be variation from enumerator 
to enumerator observing the same field and in repeated observa- 
tions by the same enumerator. A character like damage to a crop 
in the field from rust will similarly involve a certain amount of 
arbitrariness in definition and, therefore, give variable response. 
Even with factual characters like the area under the crop in a field, 
there is found to be marked variation in performance of the same 
enumerator measuring the acreage at different times or of different 
enumerators measuring the same field. In an inspection carried 
out by the statistical staff to test the reliability of the area records 
maintained by the patwaris (village officials) in 61 villages selected 
at random in the Lucknow District (India), about 20% of the 
reports were found to be in disagreement. Part of the discre- 
pancies could, of course, be explained by carelessness or even 
dishonesty but most of them were due to differing descriptions 
of {һе same situation given by different agencies (Sukhatme and 
Kishen, 1951). 


An example of faulty canvassing of a selected sample is стае 
by Kiser (1934) who selected а random sample of оу. 5 3 
studying morbidity. The relative frequency distribution o a si 
of households included in the sample and as revealed by the oe 
is given in Table 1.1, which shows that the sample is conse 
deficient in the frequency of households of size 2. Kiser ee у % 
the deficiency to the failure on the part of the pee о д 
re-visit missed households in which childless marre wo 
working away from homes are likely to predominate. 


i 1а force 
A similar bias attributed to the poor execution В Жайт. е yield 
of the selected sample arose їп a survey for estim 
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TABLE 1.1 


Relative Frequency Distribution of the Size 
of the Households 


Size of Household а A ef eee 
2 19-4 26-8 
S 25-0 26-5 
4 23-5 21:9 
5 15:4 13-0 
6 8-1 599 
7 3-5 3-2 
8 159 1:4 
9 and over 2:2. 1-3 


of wheat carried out in Uttar Pradesh (India) іп 1943-44. The 
harvesting had already commenced when the investigators went 
to conduct experiments in the selected fields, with the result that 
the fields harvested in the sample contained a larger proportion 
of irrigated fields than was present in the population. For, it is 
usually the unirrigated fields which mature earlier and which had 
therefore, been harvested by the time the investigators commenced 
their experiments. The instruction to select the next available 
field for experiment would have only resulted in increasfng the 
preponderance of irrigated fields in the sample. 


Faulty methods of estimation can be illustrated with reference 
to a plan for estimating the average yield of a crop per acre. 
If, for example, an equal chance is given to every field under the 
crop to be included in the sample and the average yield' per acre of 
larger fields is different from that of smaller ones, then the average 
yield per acre estimated from the simple arithmetic mean of the 
yields per acre of the fields in the sample may be markedly 
different from the average yield per acre of the total area under 
the crop. 

The net effect of these discrepancies on the value of the estimate 
may not, however, always be serious, particularly in cases where 
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errors occur in both directions and there is a reasonable chance 
of their cancelling one another. Errors, particularly those due to 
variability in reporting, introduce an additional component of 
variability which goes to inflate the estimate of the sampling error. 
It is frequently found that these errors do not cancel out and 
that the net effect is a bias due to the tendency to uniformly 
report a higher (or lower) figure than the true unknown value. It 
is, therefore, important to control these errors as far as practicable. 
Methods of measuring and controlling non-sampling errors will 
be discussed in Chapter X. In this chapter, we shall give examples 
which show that the magnitude of non-sampling errors can some- 
times be much larger than what is commonly supposed, thus 
emphasising the need for reducing them as far as possible. 


Example 1.5 


This example is taken from an experiment conducted by the 
Indian Council of Agricultural Research at Poona (India) for 
evolving a method of obtaining a representative sample of fibres 
from a bulk of wool for the study of three wool characteristics, 
viz., length, fineness and medullation. The experiment consisted 
of preparing well mixed lots, each weighing about -6 gram, from 
a commercial mass of wool. Each lot was combed out and spread 
on a velvet board in a uniform thin layer and divided into three 
approximately equal sections by referring to a scale placed across 
the fibres. The first section was used for drawing а sample of 
200 individual fibres by method (a) which consisted of drawing 
individual fibres with the help of random numbers by reading on 
the scale placed across the fibres. The second section was used 
for drawing a sample by method (b) which consisted of drawing 
bunches of approximately 50 fibres each from 4 random positions 
of the spread-out wool. The third section was used for drawing 
bunches of 100, but we will not refer to the results of this section 
here. The fibres in the samples, as also those left behind in each 
section, were then measured for length. 

Table 1.2 gives the results of experiments on 24 lots measured 
by the same observer. It will be seen that the sample estimates 
based on individual fibres as sampling units exceed the es 
ponding population values in all the 24 lots, although the diff- 
erence between the two has varied rather considerably from 
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lot to lot. The results show that there is a consistent over- 
estimation in the method of sampling individual fibres. The-results 
of sampling by method (5), however, show that in 12 out of 24 
lots, the sample estimate is larger than the corresponding popula- 
tion value, in 11 cases it is lower, and in the one remaining lot 
the two are equal, thus showing absence of bias in the method 
of sampling by bunches. It is clear that some conscious or 
unconscious tendency to select longer fibres of wool is introduced 
when method (a) of sampling is adopted. The procedure of 
random sampling implies identification of each one of the fibres 
in the population with the serial numbers 1 to N, and then 
selecting a sample of fibres with the help of random numbers. 
This, however, is ап impracticable procedure to adopt in the 
sampling of wool. А practicable procedure is the one that was 
actually followed of spreading the lot in a thin layer across a 
velvet and selecting fibres from random positions with the help 
of a scale placed across it. The method, however, gives scope 
to the observer to select one fibre out of the several possible 
fibres in the neighbourhood of the random position. This scope, 
and with it the bias, is reduced when sampling is done by bunches. 


Example 1.6 


This is taken from an investigation carried out in Krishna 
District of Madras State (India; for comparing the efficiency of 
plots of different sizes in large-scale sample surveys for estimating 
the average yield of paddy (Sukhatme, 1947). 36 villages,«distri- 
buted equally among the 6 subdivisions of the district, were selected 
for the investigation. In each selected village, 3 fields were selected 
at random out of all the paddy-growing fields in the village, and 
within each field the following plots were marked at random: 


(a) A rectangle of 50 x 20 (links)? (area 435-6 sq.ft.), which is the 
plot sjze adopted in official crop-sampling work in Madras; 
(b) Two circles of radius 3 ft. each (area 28-3 sq.ft.); and 
(с) Two circles of radius 2 ft. each (area 12-6 sq.ft.). 
Besides, the whole of the remaining field was harvested. The 
rectangular plot was marked with the help of tapes and pegs in 
the manner described in Example 1.4. The circular plots were 
marked with the help of a special apparatus devised for the 
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purpose, consisting of a rotating peg, a steel tape and a plumb 
line. The peg was made of wood and was provided with an 
iron collar at one end and a point at the other. It was fixed at 
a point in the field located by means of a pair of random numbers. 
The steel tape was so fixed as to revolve fully round the centre of 
the top of the peg. As the tape was revolved, the crop was cut 
from below the level of the tape, thus making room for the tape 
to move further until the original starting point was reached. 
To avoid trampling, the point located by means of a pair of 
random numbers was not taken as the centre of the circle but was 
taken as a point of its circumference on a line parallel to the length 
of the field. On arriving at this point, the worker was asked to cut 
the crop from this point along the direction of the length until he 
reached a distance slightly exceeding the radius from this point. 
From the starting point the worker then measured exactly a 
distance equal to the radius along the direction of the length 
of the field and fixed the peg at this point. This was the 
centre of the circle. The field work was carried out by the 
local staff of the Department of Agriculture who had been given 
thorough training prior to the commencement of the investigation. 
The results of the investigation are reproduced in Table 1.3. 


TABLE 1.3 


Average Yield of Paddy іп Lb.|Acre for Plots of Different Sizes 
Paddy Survey in Krishna District (Madras) 


Size and Shape Area in No. of Average Yield Standard Толоев 


of Plot Sq.ft. Plots in Lb./Acre Error Estimation 
Whole field js хе 108 1939-2 107-3 54 
50 20 (links)? .. 435-60 108 1954-1 105-0 0-8 
3’ circle Vis 28.27 216 2025.9 125-8 4:5 
2’ circle МЕ 12:57 216 2113:2 129-1 9-0 


It is seen that while the yield estimate from the official plot size of 
50 links x 20 links is in close agreement with that from harvesting the 
whole field, those from small plots are considerable over-estimates. 

The instructions for locating the starting point and for marking 
of plot were as objective as they possibly could be. Nevertheless, 
any one who has had experience of measuring the length of a field 


———— 
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and walking from a given point in the field along the direction 
of its length will agree that the starting point of the plot and the 
direction along which it is to be laid in the field could at best 
be determined only approximately. Even if the same observer 
were to locate and mark the plot determined by a given pair of 
random numbers at different times in the same field, the plots 
may occupy different positions. The inclusion or exclusion of 
particular plants on the border of the plot in demarcating it will 
similarly depend upon the judgment of the experimenter. The 
area actually cut may also vary from the one intended to be cut 
due to unevenly sown crops and errors in measurement. If all 
these deviations could be ascribed to a random element, one would 
expect the errors to cancel out. The results given in Table 1.3, 
however, indicate that this is not the case. They show that small 
plots significantly over-estimate the yield, although the degree of 
over-estimation becomes smaller with larger plots. It is obvious 
that the overall influence of the various non-sampling errors 
relative to the produce harvested becomes smaller with the increase 
in plot size until, when the plot size is large enough, such as is 
used in official crop-sampling work, the bias becomes negligible. 

The above examples will show the need for exercising care in pre- 
paring the design of a survey so that, as far as possible, biases are 
absent. Where it is not practicable to ensure absence of bias, 
one should at least satisfy oneself that the bias present, if any, in the 
sample estimate is so small as to be negligible in comparison with 
its standard error. 
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CHAPTER II 
BASIC THEORY 


A. SIMPLE RANDOM SAMPLING 
2a.1 Introduction and Notation 


Simple random sampling is by far the most common method 
of sampling in surveys. It is operationally convenient and simple 
in theory, as the name suggests. We shall first present the basic 
theory of simple random sampling and then go on to consider 
briefly the theory of sampling in which the probabilities of 
selection are unequal. We shall give the theory in its application 
to both quantitative (i.e., measurable) and qualitative characters. 
We shall assume, unless otherwise mentioned, that t 


he sampling 
units are drawn without replacement. 


Let 
М denote the number of sampling units in the popu- 
lation, 
y the character under consideration, 
Vi the value of the character for the i-th 
sampling unit of the population, 
Ўн the mean value of the character “рег unit 
of the population given by 
N 
Ж 2 Ji 
Ум = N (1) 
5: E mean square for the population given 
y 
i ы 2 
| s= 2 Q: Ум) 
N—I 
2 y? — Nyy? 


|| 


NI о) 


— 
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У (у) the variance of a single observation in the 
population given by 


N 
y 2 (»—Уң)® 
(у) =! WO (3) 
fusca us 
= = 5 (4) 
n the size, ie, the number of sampling 


units in the sample, 


Yn the sample mean given by 
Sy, 
= = = 5 
9 = (5) 
апа 
s? the sample mean square given by 
euim (4 —In)* 
п—1 
п = 
рае (6) 


| 


п— 1 


where the summations extend over all 
the units in the sample. 


2a.2 Unbiased Estimates of the Population Values 


An estimate will vary- from sample to sample, depending upon 
the units included in the sample. Thus, for the population 
mentioned in Section 1.5, the sample means will be seen to vary 
from 2:5 to 5:5 acres per farm. The sample mean square 
will similarly be found. to vary from 0-5 to 12:5, as shown in 
Table 2.1. It will, however, be seen that the averages of the 
sample means and sample mean squares over the totality of 
samples are equal to the corresponding population values. Such 
sample values are called unbiased estimates of the population 


values. Algebraically, this is expressed as: 
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9, 
EG.) = 378 
В N 
n 
(а) 
— VN (7) 
E 
зу == 273° 
E (s?) (У) 
п 
= 52 (8) 
where the symbol Е stands, as usual, for expectation. We write 
Est. ју = y, (9) 
and Ў 
Est. S? = s? (10) 


but sometimes when it is more convenient, we shallalso use the 
circumflex notation to denote the estimate, ав), and 6% It will 
be shown in the following sections that, when a sample is selected 
by the method of simple random sampling, an unbiased estimate of 
the population mean is given by the sample mean and an unbiased 
estimate of the population mean square by the sample mean square. 


TABLE 2.1 


Values of the Mean and the Mean Square in Different Samples 
of Two from the Population Mentioned in Section 1.5 


Serial Values 


No.of of Units ў, SO Pan (аў) 52—52 (52—52)2 
the in the 
Sample Sample 
== 
1 2,3 235 0-5 —1-5 225 —25/6 625/36 
2 2,4 350: 240; =i 1:00 -16/6 256/36 
3 2,7 4:5 12:5 0:5 0-25 47/6 2209/36 
4 3,4 35 05 -05 025  .25/ 625/36 
5 287 50 80 10 1-09 20/6 400/36 
6 4,7 555 4-5 1+5 2-25 — 1/6 1/36 
Total 24.0 280 0 7-0 0 4116/36 
Меап 40 4 0 14 0 19-05 
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2a.3* Markoff's Theorem 

In the sequel, unless otherwise stated, we shall consider only 
linear combinations of the values for the units in the sample as 
unbiased estimates of the population values. Among these, we 
shall call that unbiased linear estimate the best which has the 
minimum sampling variance. It can be shown by a corollary 
of Markoff's theorem (Neyman and David, 1938) on unbiased 
linear estimates that the best unbiased linear estimate of the 
population mean jy is given by that value of yy for which 
и = X (yi — Sy)? is minimum, if уз, ys ..., Yn aren Observations 


ісі 


on n variates having the same variance and the same covariance 
with means given by E (у) = yy, It is easily shown that и is 
minimum when jy = ja. 


2a.4 Expected Value of the Sample Mean 


We write 


3 ү, 
Еф) = Е | d 2 (1) 


where у; stands for the value of the i-th unit of the population, 
and the summation is taken over all the п units in the sample. 
Numbering the units in the sample serially, as 1, 2, .-., 75 +++» 
we may write (11) as 


| Tw 
EG)-E | È») (12) 


where у, now stands for the value of the unit included in the 
sample at the r-th draw. 


By a well-known theorem in probability, the expected value of 
a sum is the sum of the expected values. We, therefore, write 


EG) = 1 Бол Од E GREG) 03 


* This section may be skipped over at the first reading without losing continuity 


of the text. 


24 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


Now, by definition, 
N 
Е (у,) = 2 Рај (14) 


where P;, denotes the probability of drawing a specified unit y; 
at the r-th draw. We have seen in Section 1.5 that, in simple 


random sampling, this probability is equal to 1/N. It follows, 
therefore, that 


.. 1 N 
EQ) = y i» 
іші 


= jy (r =1, 2, ..., n) (15) 
On substituting from (15) in (13), we get. 
Е(р) = Fy (16) 


An alternative and, in many ways, a more instructive approach 
is to write the sample mean in the form: 


1 у = 1 һ “| ` (17) 
іші 


where 
a, = 1 if y; is in the sample, 
and 


a; = 0 otherwise. 


From (17), taking the expected value of both sides, we write 


1 т 
й | z. 4 п 


|| 
- 
— 
Ms: 
2 
= 
M 
==) 


(18) 
i=1 
Now, clearly, 
E (aj) = 1.{Probability that у, is included in the sample} 
n 
= ў (19) 


in virtue of Section 1.5. 


BASIC THEORY 25 


Hence, substituting n/N for E (оф) in (18), we reach the same 
result as (16). 


2a.5 Expected Value of the Sample Mean Square 
By analogy with (14), we have 


& N 
EQ) = у ) эё (20) 


Adding and subtracting 7,2 from the right-hand side and using 
(2), we can write (20) in an alternative form as: 
1 5 à 
Ер) = + (1—1) S (21) 


It follows that 


= iP + (1 = x) 5° 0? 


Again, if y,’ and у,’ are used to denote the values of the units 
drawn at the r-th and s-th draws respectively, say y; and yj, we 
have, by definition, 


2 N 23) 
Е(уг у) = 2 У НОРА ( 
ї56ј=1 


Where Рн 


given у;. 


denotes the probability of drawing уј at the s-th draw, 


Now, from Section 1.5, we have 
24 
& == | (24) 
А N 
and, by an extension of the same result, 


1 (25) 


Pun == y pem 
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Hence, substituting for P; and Р; from (24) and (25) in (23), 
we get 


N 
E DENEN M TM 
EON) = guy 29, Q6) 
It follows that p 


| " | E | sux u 
n(n — 1) 2S CB 1) | Хд 


і тәбяші 


171 


N 
1 
= питу 2,9% (27) 


ізбізі 
The result сап be alternatively established as follows: 
We have 


І = 1 * 
п(а-1 5 p ы ол: —1) һ E (оа) „| (28) 


122) 196ј=1 


where the summation У extends over the м (n — 1) product terms 


17) 
yiy; in the sample, and Е (aiaj) denotes the probability of including 
yi and у; in the sample. Now, 


E (аа) = E (a) E (а, | а) (29) 


where Е (а | а) denotes the probability of including у; in a 
sample of (л — 1), after у; is already drawn, from (N – 1). 


Clearly, by the same argument by which we derived the value 
of E (о), we have 


E(s |a) = 21 (30) 


It follows, therefore, that 


—1 
Ea) = x G1) 


Hence, on substituting from (31) in (28), we have 


1 n 1 N 
n (n = 1) E D Д = N(N— 1) y ЖУ) (32) 


ij iAi=1 
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which can, alternatively, be expressed as: 


n N 2 N 
1 1 % 
гис” Із Z = N(N (È ») " 
12) іші б іші 


x {мг — Мур — (N — ns} 


2 
= је - 5, (33) 
or 


2 


EQN) = Је N en 


Finally, using the results in (22) and (33), we have 


NER: =“ 2 
eoa =) Xs 


- Ar pes (1s here be- S1] 


n? 
= Дјуи +(e- Wr) вло né 
n^ 
LEG) gd 
NS (35) 
E 


\ 


Ew = Е {+> (2- 229) 


= 1 fest s nE 6.9} 
== 


28 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 
1 IM BC а п 2) 
a sse | (а ғ) S? — ny? (1 Тө 
= 5% (36) 


showing that 57 is ап unbiased estimate of 52. 


2a.6 Sampling Variance, Standard Error and Mean Square Error 


We have seen that a sample estimate differs from the population 
value by varying magnitudes in different samples. This difference 
between the sample estimate and the population value is called 
the sampling error of the estimate. An important requirement 
of a sampling method is that, in addition to giving an estimate 
of the population value, it should provide a measure of the 
sampling error in the estimate. Since the actual sampling error 
in an estimate cannot be known, we obtain a measure of the 
average magnitude over all possible samples of the sampling 
error in the estimate. A simple average of the actual sampling 
errors over all possible samples is, however, zero in the case of 
unbiased estimates, as seen from Table 2.1. An average of the 
sampling error without regard to sign provides one measure, 
called the mean deviation, but this is not in common use. The 
average magnitude of the squares of sampling errors over all 
possible samples is called the sampling variance of the estimate 
and its square-root is the measure most commonly used for 
defining the average sampling error. This measure of the average 
sampling error is called the standard error. In defining the standard 
error as above, we assume that the sample estimate provides an 
unbiased estimate of the population value. More generally, for 
a biased estimate of the population value, the sampling variance 
is defined as the arithmetic mean of the squares of the differences 
between the sample estimate and the expected value of the 
estimate over all samples, and its square root is called the 
standard error. The arithmetic mean of the squares of the differ- 
ences between the sample estimate and the population value, in 
this case, is called the mean square error. 


2a.7 Sampling Variance of the Mean 


Let V (Fn) denote the sampling variance of the mean. Then, 
by definition, we have 
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VG) = ЕК, — Е Ол) ] 
= Е (9,2) — (Е (ОР (37) 


Substituting from (16) and (35), we obtain 
о ТЕ Ml ves 
иб) = (у у) 8 (38) 
which can also be written as 


иф) on. 8 (39) 


The reader may verify that the value of the sampling variance 
derived from this formula is the value actually obtained in 
Table 2.1 for a sample of size 2. 


The factor (N — n)/N in (39) is a correction for the finite size 
of the population and is called the finite population correction 
factor or simply the finite multiplier. When n is small as compared 
with N, the multiplier will approach unity and the sampling 
variance of the mean will approximate to that for the mean of 
a sample drawn from an infinite population. 

Usually, the value of S? will not be known. Its estimate from 
the sample will, therefore, have to be used in calculating the 


sampling variance. Thus 
Nas (40) 


and the estimate of the standard error is given by 


5 N—n, 5 
Est. S.E. (у) = ar ‘Tt (41) 
This estimate, however, as will be clear from Section 2a.11, has 
a slight negative bias. 
It is instructive to derive the above result using the known 
result for the infinite population. Let ш and ug represent the 
mean and the variance for the infinite population, so that 


Е (у) = 
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and 
Е (у — и) = р 
We may write 
On — м) =n — Pu) + (Фу — ш) 
Squaring both sides, we have 
On — ра) = OG, — Iu)? + Gu — шу + 2 (Fn — Fy) Os — ш) 


If we now regard the finite population N itself as a random 
sample from the infinite population and, for any given N, choose 
n at random out of М, then the sample of п also becomes а 
random sample from the infinite population. Taking expectations 
of both sides in two stages, first for fixed y, ys, ..., yy and then 


over all possible samples of N from the infinite population, we 
write 


Е [E (6, — њм)? | Yo Yo «+ +> У] = E(V (Hn) | yo Vaso oes Yn} 
+ Е (hy — у)? 
+ 2Е (s — m) ЕФ, ШЕТ ум) 


But we have already seen in (16) that E (),)--),, and so 


Е (фу — Fn) |у, --- Yyy} = 0. The last term in the above expres- 
sion is therefore zero. Hence 


Ez = E {V (Pa) |у, ZEE Ae 


or 
E {V (ў) |у» -ә»-еі nu 
It follows that 


VAO) од = (2— р) 


But the estimate of м in a sample of М is provided by S?. Hence 


Vin) | ys; Dn Jy} = =; 5° 
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2a.8* Partitional Notation 


An expression for the sampling variance of s? can be similarly 
derived. The calculation, however, involves much heavier algebra 
than in the case of the mean. The calculation of the sampling 
variance of the higher order moments is even more laborious. 
Their derivation is greatly facilitated by the use of partitional 
notation. We shall therefore digress to introduce this notation here 
and illustrate its application to derive the sampling variance of 5". 


A partition of a number w is a collection of positive integers 
1 to 9 whose sum is equal to w. The integers are written in a 
descending order of magnitude and enclosed in brackets ( )E 
A partition which has p parts is called a p-part partition and the 
number w is called the weight of the partition. A repetition of 
the same part is indicated by exponents. 


Examples: 


(i) (2) and (12) are 1-part and 2-part partitions of 2 respectively. 


(ii) (3), (2 1) and (13) are 1-рагі, 2-part and 3-part partitions 
of 3 respectively. 


(iii) Та general, (р рь"... pn?) is a p-part partition of the 
number w, where p; is repeated т; times (i= 1,2,..., 0), 


h h 
p= X m and w= 5 рт. 
іші іші 
The functions of observations we require are the monomial 
symmetric functions, called the g-functions in the notation of this 


book, and are written as g (pi: ра... pn^). Thus = y? will be 


У уу; 21), and x i Yj Ук by 
denoted Бу g (3, | 2, by g (21) БҮЛ 
3! 2 (18). In general, (һе monomial symmetric function 
Ж uu. P ig ia t «i а Taara s Урук A) 
паре ^ : 
will be denoted by 


my! ao! ... т, g (pi^: py «++ рь"һ) (42) 
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The special case of the g-function involving one-part partitions and 
thus denoting sums of powers of observations is called the 
s-function, and is denoted by replacing g by s, as s (р), or some- 
times by putting p as a suffix, e.g., sy. A product of two or more 
s-functions like 57, Sp,» ..., Sp, is written as s (p, рә... py), where 
(pı р». - Dn) is an h-part partition of the number p; + po+...+pn- 


Examples: 


5(3) = X y? 
іші 


Обен (54) (2 м) 


say =t = {3 р 
In general, | 


5 (p? рь"... py) = (s (p)? (5 (pa)}™ ... (5 (рају 


= (у (и [2 


We shall use capital symbols G and 5 to denote functions of 
observations for the population; small symbols, g and s, denoting 
as above the corresponding functions for the sample. 


The g-functions can be expressed as linear functions of the 
products of s-functions, and vice versa, by means of the following 
identity : 


2 3 
— Hog — а) = за +55 фар ee 


. where a is any constant such that 


1 
lal < паху; 


giving us 

g (р ра" ...) = X ga (Р, О) 5 (0° ay? ...) (43) 
апа 

в (n? рә"*...) = 25 (Р, О) 8 (a? фу ...) (44) 
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where P and О stand for the partitions (ру pz. ++) and 
(454.55...) of the same number w given by 
w = X рит = = dixi (45) 


27; and ХХ; denote the numbers of parts in the partitions 


Р and О respectively, gs (Р, Q) denotes the coefficient of 
5 (Фу q...) in the expansion of g (mi^ р, SEA) 
denotes the coefficient of g (415449...) in the expansion of 
5 (рут: рыт: ...), and (43) and (44) are summed over all the 


different partitions Q of w, including Р, 
Examples: 

g (13) =— 4s + ја 

g (212) = зу — зуу — 15? + 3s? 
= g (4) + 2g (29 

s (321) = е (6) + g (51) + g (42)+ 25 (32) +g ( 
Values of the coefficients sg (P, ОХ! Xe! and 71! 721. · 8s (P, О) 
for weights less than or equal to 8 have been tabulated 
(Sukhatme, 1938) апа reproduced in the Appendix to this chapter. 


Expansions of g-functions in terms of s-functions, and vice versa, 
can be readily written down with the help of these tables. 


~ 

~ 

t3 

= 
| 


321) 


2a.9* А Theorem on Expected Values of Symmetric Functions 
We shall now state and prove the following theorem relating 

to the expected value of a g-function in samples selected by the 

method of simple random sampling from a finite population. 


Ё Theorem.—The expected value of а monomial symmetric func- 
tion of the sample observations is ep times the corresponding 
function for the population where 


" п(п—1).. (п—рР + 1) (46) 
N(N—1)...(V—p +) 


E(g (руз рута ... гы) = ер G ("Ра pi?) (47) 
3 
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Proof: 
Е {в (py po. . . p,?1)) 


1 : р р р, р, n 
E LULA ы ЖЕ а” Jur Vig, Vira +, Vir erste 1) 


1 


Da ma id) Piss у, Peeve. у Ph 
ті! 79!.. = { TT A а Qs Ji tee Vir, ie a Утта 5) 


where a is a simplified notation for the product 


94, 04, «a, eem 


а= 1,16 Vis у t5 тутата аге in the sample, 


and 
а = 0 otherwise. 
Hence 
E {g (p^: p^. in) 
= 1 ы 2 р n n | 
n um аа 00 0% Шы Ут" Vima ts Viren, gena В) 
But 
Е (а) = 1-Probability (а = 1) 
ж д1). (п — р ЫП) 
N(N—1)...N—p Fl) * 
= ep 
Hence 


Е {в (n^: po... .p,7)} = Е (а) G (p'ip,?:.. Р") 
à = ep G (руз р"... гут) 
2a.10* Sampling Variance of s? 


By definition, 
V (s?) = E ((57)%) — (Е G9)? 
-E(s) — s: (48) 
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Now 
(п — 1} (3) = [30:9 i] 


= (зе со ҚАУЫ 3 РЭ) (49) 
А reference to sg (P, О)/Х,! X,!... table of weight 4 gives 
ý 2 
(n — 1)* (9? = {8 (4) + 2g (29) — 5 (5 (4) + 2g (31) + 22 (22) 


+ 2g QU + i (е (4) + 4g (31) + 6g (22) + 12g (21?) 
+ 24g (1%) 
whence, taking expectations of both sides and using the theorem 
of the previous section, we have 


(n — DEES = OF ag — 22 аб GD 


2 (а — 2n + 3) еб (22) — -; 4 in — 3) еб (21?) 


+ аз n? 


+24 2619 (50) 


п“ 


It is customary to express the expected values in terms of the 
population moments about the mean. This is done with the 
help of the table for 7,!7,!...Gs (P, О) in the Appendix. We 
may, for simplicity, assume without loss of generality that the popu- 
lation mean, or S}, is zero. 

Substituting for the С functions the S functions, by using the 
aforesaid table of weight 4, we obtain 


(n — 1)? T = 1) as, 


(n — ye Eqs) = 0 
52 у 2 (п — 3 
+= 3 Css — 2679 е, (08, — 50) 


+ h e, (— 68, + 35,2) (51) 
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Collecting terms and remembering that S, = Ми), 5, = Ми» and 
5° = Мрј (У — 1), we have 


2 Л... е — е 2 (е, — Зе, + 2e3) 
И (5°) = (п = 1)? [ма ( п? 52 P E 


а — 7e, + 126, — ba , aa ез 2(е,--е 
T z ni 3 3 + Маи“ (2 жер ae ә 


(52) 


п? 


д. 3 (e — 2e; + e) |- S NAE: 
Џ (У = 1)? Из 


It can be verified that the value 19-05 for the sampling variance 


of s? obtained in Table 2.1 agrees with that derived from the 
above formula. 


The corresponding formula when N is infinite is readily obtained 
as the limiting case as М—> со. In this case, we have 


Nie, =0 for i<j 
and 

Nie, = n (n — 1).. (п — j + 1) 
Hence 


ра — pa? 2 
FOU m Ћи (n —1) Ра 63) 


Using the Pearsonian notation for d 


| eparture from normality, this 
can be written as 


У (s?) = 51 [5 a 2 | 


n(n — 1) (54) 
where 
Ёг= 5 
For the normal population B, = 3, so that 
z2 64 P 
E e Fa (55) 


and 


SEW = AJ 15 (56) 
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Expected values of higher order sample moments and their 
products have been worked out by adopting the above procedure 
and tabulated for ready reference by Sukhatme (1944). 


2a.11 Expected Value and Sampling Variance of s 


We have seen that the sample mean square s? provides an 
unbiased estimate of S?; and that, in samples from large popu- 
lations, its sampling variance is given by 


жыз 9 
y (s = pa — Ha Ей 


п n(n — 1) pa 


where ә and p, denote the second and the fourth moments of the 
population. However, sometimes, we also need to know the 
behaviour of s, the standard deviation. This is obtained as follows: 


Let 


53 = 5- + є 


where 


and 
E (e) = V (s?) 
We may write 


s.— (S? + 9i 


% 
ee | 
== 5(1 ЕВ =) 
Since e will be small as compared with S? with a probability 
approaching 1 as п becomes large, we may expand the right-hand 
side as a series, neglecting powers of е higher than the second. 
We then have 


z: aE д CES) ЄТ ш 
5-8 L ir mu - 51 55 (2) EL: | 
Taking expectations of both sides, we obtain 


вез — $$ 67 
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The result shows that s will under-estimate S although, if л is 
large, the bias will be negligible. 


Turning now to the evaluation of the sampling variance of s, 
we have 


У (5) = Е {s — E(s)? 
= EG?) — (E ()* 


=s -s fi- ҰО) 
=s {1 t- 2509 | 
= ^ 


where У (s?) denotes the variance of s? to terms up to 1/п only. 
2a.12 Confidence Limits 


The standard error gives an idea of the frequency with which 
errors (differences between the sample estimate and the population 
value) of a given magnitude may be expected to occur if repeated 
random samples of the same size are drawn from the population. 
Usually errors smaller than the standard error will occur with 
a frequency of about 68%, and those smaller than twice the 
magnitude of the standard error will occur with a frequency of 
about 95%, provided the estimate is approximately normally 
distributed. In general, if the sample size is not too small and 
N is large and if the estimate under consideration is a linear 
unbiased estimate of the population value, then the frequency with 
which errors will exceed a fixed multiple of the standard error of 
the estimate is approximately equal to the frequency as determined 
by the normal law. Consequently, from a knowledge of the 
standard error of the estimate and with the help of the normal 
probability integral tables, we are in a position to locate the 
actual unknown population value within certain limits with a 
known relative frequency. To take the example of estimating the 
population mean, we know that the mean of a random sample 
will be approximately normally distributed if the size of the 
sample is not too small and if the population from which it is 
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drawn is not very different from the normal. We may, therefore, 
expect that 


y. m ZE 
MEN у” 5 (59) 
on an average in 68 out of 100 occasions, and 
2 s М-п 
cds | ЕС? 5 2 
19, = ређе а 5 (60) 


on an average with a frequency of about 95 out of 100. In 
general, we can expect the inequality 


NES у а е) ТЕС. 5 (61) 


where Қа, „у is the value of the normal variate corresponding to 
the value 1 — a/2 of the normal probability integral, to hold on 
an average with a probability 1 — а. The two limits, on either 
side of the population mean in (61), are called the confidence 
limits and the interval between them the confidence interval. The 
probability with which the inequality holds, viz., 1 — а, is termed 
the confidence coefficient. 3 

It should be noted that the confidence limits may vary from 
sample to sample. Thus the confidence limits for the six different 
samples mentioned in Section 1.5 at the 68% and 95% confidence 
coefficients work out as shown in cols. 2 and 4 of Table 2.2. 


А TABLE 2.2 
Confidence Limits for Different Samples Mentioned in Table 2.1 


Confidence Limits Confidence Limits 
Sample (1 — а = °68) б ед5) 
Мо. 

Based оп S Based on s Based on S Based on 5 

а) (2) (3) (4) (5) 
1 1-4, 3-6 1:8, 3:2 0-3, 4-7 - 2:0, 7-0 
2 1:9, 4-1 1-7, 4:3 0-8; 5:2 — 6:0, 12:0 
3 3:4, 5-6 172. 188 2:3, 6-7 —18-0, 27-0 
4 2:4, 4-6 2-8, 4-2 13 587 = 1:0, 8:0 
5 3-9, 6:1 2-4, 7:6 2:8, 7-2 -13:0, 23:0 
6 4-4, 6-6 245; 755 383: "724; = 8-0, 19:0 
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It will be observed that in four out of six cases, the population 
mean is contained within the confidence limits given in col. 2, 
while in all the six cases it is contained within the limits shown 
in col. 4, as is to be expected. The result is of course fortuitous 
in view of the small size of the population but it serves to demon- 
strate the meaning of the inequalities above. 


When 5° 15 not known, we use its estimate 52 obtained from 
the sample. The statement in (61) with S? replaced by its estimate 
5° will, however, no longer be exact. To obtain the confidence 
limits in this case, we make use of the result that (n-p) S.E. (Fn) 
is distributed as Student’s ¢ with (л — 1) degrees of freedom when 
n is not too small and the original distribution is not far removed 
from the normal. If we denote by fe, по) the value of / corres- 
ponding to the level of significance a for (л— 1) degrees of 
freedom, it follows that we may expect the inequality 


ў, —Р 
Бе = ји] а sez (62) 
ТІ Nn 5 
to hold on the average with probability (1 — a). The (1 — a) 
confidence limits when the size of the sample is not too small and 


the population from which it is drawn is not very different from 
the normal are, therefore, given approximately as 


E N—n = = МЕ 
Vn — Қа, nay му oe за Jy S Pa + fq, п-1) " Nn Bs (63) 


For the six samples in Section 1.5 and for the same confidence 
coefficients as given above, these confidence limits bas2d on s? are 
given in cols. 3 and 5 of Table 2.2. The values of зәл) and 
1.05, 1; have been interpolated from the r-table, being 1-85 and 12-7 
respectively (Fisher and Yates, 1938). 


2a.13 Size of Sample for Specified Precision 


Almost the first question which a statistician is called upon to 
answer in planning a sample survey is about the size of the sample 
required for estimating the population value with a specified 
precision. The precision is usually specified in terms of the margin 
of error permissible in the estimate and the coefficient of confidence 


der" 
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with which one wants to make sure that the estimate is within the 
permissible margin of error. Thus, if the error permissible in the 
estimate of the population value of the mean is, say, ey, and 
the degree of assurance desired is 1 — a, then clearly we need to 
know the size of the sample so that 


P {| Ja — у | > ду) = а (64) 
Hence, from (61), we have 


Га, со) = 
_ e? Уһ? 65 
= 1 {агы S (65) 
Мел а күш 


Ум 


15 


The determination of the size of sample from (65) presumes 
the knowledge of the coefficient of variation for the population. 
This can only be roughly estimated. Consequently, (65) can give 
only a rough idea of the size of the sample required for estimating 
the population mean with a specified precision. We can, however, 
improve upon the predicted value of as follows: 


Although the size of the sample is determined from (65), the 


confidence limits after the survey is completed are obtained from 
(63). In other words, п could be more precisely evaluated from 


2 2 
1а, п-1) S 


5 


Е 1 бала) 5 
М è ум 


had fa n- been known, which it is not, as it itself depends 
upon л. As a result л is underestimated since fq.) is less than 
ta, по) Тһе obvious correction which suggests itself is to increase 
the value of л in the ratio Pana) | а, со» Where n' is evaluated 
from (65), but the correction is not likely to be important unless 
n is small. 


The calculation of л from (65) also assumes knowledge of 5 
when the error, єў, permissible in the estimate of the population 
value of the mean is given, although the confidence limits after 
the completion of the survey are calculated from (63) which makes 
use of 5. 
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An allowance for this inaccuracy can be made by making use of 
the idea, originally due to Neyman (1934), of selecting a preliminary 
sample for improving the sampling design of the survey (Sukhatme, 
1935). Let л, be the size of the preliminary sample and s, 
denote the estimate of 52 obtained therefrom. Then the additional 
sample required for estimating the population value with the 
desired accuracy, assuming N to be large and cy, to be given, 
will be n — пу, where 


п = Umen S. (67) 
Re Ум“ 
It has been shown that л so estimated satisfies the statement in 
(64) and, on the average, gives a more accurate confidence interval 
than when S is unknown, but further discussion of the problem 
is beyond the scope of this book. 


2а.14 Hyper-Geometric Distribution—Two Classes 


We shall now consider the theory of simple random sampling as 
applied to qualitative characters. Consider, first, a situation in which 
the sampling units in the population are divided into two mutually 
exclusive classes, class 1 consisting of units possessing the attribute 
under consideration, and class 2 consisting of those not possessing it. 


Let 


p denote the proportion of sampling units in the population 
belonging to class 1, and 


q the proportion of units falling in class 2. 


Evidently, Np will be the number of sampling units in the popula- 
tion belonging to class 1, Ма the number of sampling units in class 2, 
and Мр + № = М. Now, clearly, the probability P (тп) that in 
a sample of z selected out of N by the method of simple random 
sampling, п, will occur in class 1 and 7, in class 2 will be given by 


n Np Np—1 a — t 
Р) = (1) | “N-1 °° N-1n +1 


Ма Ng—1.,.. Ng-—mei 
E c == veu) 


o ай _ 
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which can also be alternatively written as 


El 

п» 

P (n) = BON (68) 
n 

The variate п, or the proportion л;/л is said to be distributed in 

a hyper-geometric distribution. Since the possible values which л, 

can assume are 0, 1, ..., п, we have 


ЗУР (и) =1 


т=о 
ог 


у Са) Cre) E (69) 


ES n 


Аз N tends to be large, the distribution (68) approaches the 
binomial, the probability of observing 7 in class 1 and / in class 2 
in a sample of n being now given by 


(ја – 
2a.15 Mean Value of the Нурег-Сеоте те Distribution 
By denition, 
E(m) = 2 mP (m) 


Substituting from (68) for P (m), we have 


= Np! Ма! ni (У —n)! 
Е (ту 2 n m! (Np — n)! n! (Nq—n;)! N! 


m=0 
__Мпр 2 (Np — 1)! 2 Ма! 
N La (m 0) (Ар — л)! n! (QR — т)! 


тел 


(n — р)! (у — 7)! 
= (N—n! 
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= ENDS [5 
— У, т E т. 
за п EY 


Np—1 ы 


му — 1/\т, 
[um 1 
п—1 
represents the probability that іп a sample of n — 1. т — 1 will 
fall in class 1 and ng will fall in class 2. Consequently 


Now 


> (mat) Ge m 


N— 
Ез n—1 
Hence | 
Е (т) ="р (70) 


Or, denoting by p; the proportion in the sample, we write 


Е (Pa) =P (71) 


It follows that an unbiased estimate of the proportion р in the 
population is given by the proportion in the sample. In other 
words, 


т = 


Est.p = 1 = p, (72) 
Similarly, 
Est. q = 7? =, (73) 


2a.16 Variance of the Hyper-Geometric Distribution 
By definition, 
У (m) = E (n?) — {Е (m)? • 
Now 


=n (и =) +m (74) 
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so that 
V (1) = E (n, (m — D) + E (m) — {Е (n)? (75) 
Also 


n 


E {т (т — 1} = У т (п, — 1) P (n) 


т=1 
2% a NPL Ж Ма! 
= ж. m @—1) LT Np —m)! m! (Ng — т)! 
ma. 
nY(N — n)! 
% NIE 


n (n — 1) Np (Np — 1) у (Np — 2)! 
(m 


_ N(N — 1) 4 — 2)! (Мр — nj)! 
x Nq! (n — 2! (М — п)! 
по! (Ма — по)! (N — 2)! 
_ п 0 Np (Np — 1) (76) 


N(N — 1) 
since the sum of the terms under the summation sign is evidently 
1. Substituting the result in (75), we have 


Vin) = nua) yee A 


= 1 =] пра (77) 


It follows that the sampling variance of the estimated proportion 
is given by 


V(r) = М1. 24 (78) 


and the standard error of the estimated proportion is given by 


S.E. (p) = М [EIN (79) 


These results can also be obtained directly from those of the 
preceding sections. All that one need do is to adopt the convention 
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of scoring the character of a sampling unit with one whenever 
it appears in class 1 and zero when it falls in class 2. On 
making these substitutions, we obtain 


N 
и Nj 
Iu = уд = р (80) 
іші 
ы als zn == 1 
»-i1)^-4-5 (8) 
Ў у? — Nj? 
JA. — INN" = 2 
MEL. anda = ie = уурар) (82) 
n alti 
Ted 2yj—n _ СИ п 


п = 1 pp o 0 (83) 


On substituting the above in (16), we reach the result (71); апа 
on substitution in (39), we get the expression for the variance of 
the sample proportion, as in (78). 


Further, from (36), we get 
Ей —РӘ} = pw (84) 
It follows from (84) that an unbiased estimate of the product 


p (1 — p) is given by 


ne NW—l 
I y Ps (1 — Ра) 


and not by 
Ра а = Pn) 
as one might suppose. We write 


Est. (p (1 — p) = ПЕН * Pa (1 — Pa) (85) 


The unbiased estimate of the sampling variance of a proportion 
in terms of the sample values is, therefore, given by 


Est. V (p) = NS at р) ан 
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and an estimate of the standard error of pp is given by 


Esi. S.E. (p,) = ie n p, Сре (87) 


This estimate, however, has a negative bias. 


2a.17 Confidence Limits and Size of Sample for Specified 
Precision 


The confidence limits for the proportion are derived on the same 
assumptions as for the quantitative characters, namely, that the 
sample proportion p is normally distributed. This will approxi- 
mately be so, unless p is too small (or large) and п is small. The 
limits are given by 


N-—np(ü-p 
P = р, + Қа, со) IN =] pt m (88) 


where fia, оз) denotes, as before, the value of ¢ corresponding to 
the significance level a and co degrees of freedom. This can be 
Solved as a quadratic in p (Bartlett, 1937) or, alternatively, as 
а, quicker approximation by substituting from (85) for p (1 — р), 
giving 
P =}, + fig, cor " EA um (89) 
As p deviates from :5, the distribution of pn becomes remote 
from the normal, and the normal theory ceases to be applicable 
unless п and N are both large. The appropriate method in this 
case is to obtain the confidence limits directly from the hyper- 
geometric distribution itself. The probability of getting nı Or 
fewer numbers in class 1 in a sample of л, is the sum of the 
corresponding terms of the hyper-geometric series. Equating this 
sum to the chosen level of significance, it is possible to solve for p. 
For any larger values of p, the probability would be smaller than 
а, and for smaller values greater than a. The value of p given by 
this equation will, therefore, give an upper limit. Similarly, 
by equating a to the sum of the terms of the hyper-geometric 
Series for variates greater than or equal to 7, we obtain the lower 
limit to p. The method, however, involves heavy computations, 
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The size of sample required for estimating the population pro- 
portion with a specified precision is obtained from (88). If the 
error permissible in the estimate is, say, єр and the degree of 
assurance desired is 1 — a, we then have 


Sean Um 90 
1+ yd ge + if ( ) 


For N large and е not too small, n is simply given by 


1, а, oo). 9 


п = «in (91) 


Example 2.1 


Material for the construction of 5000 wells was issued during 
the year 1944 in a certain district as part of the Grow-More-Food 
Campaign in India. The list of cultivators to whom it was issued, 
together with the proposed location of each well, is available. 
A large part of the material was reported to have been misused 
by diverting it to other purposes. It is proposed to assess the 
extent of the misuse by means of a sample spot check. In other 
words, it is proposed to estimate the proportion of wells actually 
constructed and used for irrigation purposes. The sample is 
proposed to be selected by the method of simple random sampling 
from the total population of wells for which the material was 
issued. The permissible margin of error in the estimaved value 
is 10% and the degree of assurance desired is 95%, Determine 
the size of sample for values of p ranging from -5 to -9, 


We are given М = 5000, є = -10 and Қа, са) = 1:960. 


Substituting in (90), we obtain for different values of р, the 
following values of n. 


n Р 357 244 159 94 42 


Since the worst critics do not place the mis 
half of the material issued, a sample of 357 w 
adequate for the proposed check. 


use at more than 
ould appear to be 
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2a.18 Generalised Hyper-Geometric Distribution 


We shall now extend the preceding results to the population 
which is divided into k mutually exclusive classes. Let N; denote 
the number of units in the i-th class of the population (i = 1, 2, 
...,k), so that 


Me 


М, =N 


її 


1 


Now, if a simple random sample of n is drawn out of this popula- 
tion then it can be seen by analogy with the distribution of two 
classes that the probability that л; units occur in the i-th class 
and п — n; in all the other k — 1 classes together is given by 


( ы) [8 — 2 
P = Мі п— 
{ny} E w 
n 
More generally, it can be seen that the probability that т, units 
occur in class 1, пз in class 2, ..., and пр in class К, is given by 


№ NM... М; 
а) my) (92) 
Pts Тај eh) =" — AN 
i) 
It follows from (70) and (77) that 
? 93 
Е (nj) = пр; F 
where 
М; 
Di — N 
and Я 
_N—n (94) 
У (п) = УІ” np; (1 — р) 


ing i two 
It should be pointed out that the numbers falling in any 
classes are not independent of each other, since 


У т =" 


t=1 


4 
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d Му oe) 


т Me J NH — lg — тз 


(а 
(102) 


The probability that їп a sample of п, Ny + Ny (Leg пу) will 
fall in classes 1 and 2 together and п — nu — Пр (Le, n — nj) 
in classes 3 and 4 taken together is similarly given by 


Р {т ћу п — ny — ng) = 


Ny + ^w) N — Ма — zu 


Ny + Nyy n — nu — Тә 
P (ni + Пру П — My — Nyy} = N 


(а) 


We know by a well-known theorem in probability that 


(103) 


Р (ny, my n — п} =P (уп — ту Р (у, ћу | m} (104) 


It follows that the probability required is given by the quotient 
of (102) by (103) and is equal to 


dr d 
P (ny, тој m) = Ati / “тә ку” 
пу 
In other words, љујп, follows a hyper-geometric distribution in 
samples of n, drawn from Nj. Hence, from (101), we obtain 


(105) 


|} =н = и | 
É T т № nh (100) 
where py, = ММ, p, = Nı/N. The result is otherwise obvious 


also, for, with 7, fixed, the situation reduces to the two-class 


problem and we apply the results of Section 2а.15. Substituting 
the result in (100), we have 


Миу — Б (и 
е (8) = х) 


ж (107) 


BASIC THEORY 53 


To obtain the variance of m/m, we proceed similarly. Ву 
definition, 


тї (% а na | 
(жук) (а) 
пп? Мы? 

=E [= (ш, |в, || Ms 


-E [ Е fin (a Ea 


n? 


m || = xs (108) 


We have seen that луу[пу follows a hyper-geometric distribution 
in samples of n, drawn from Nj. It follows that 


т _ Ап 1 Ра 1 (109) 
"EL. |тј= 5 п pn 


and from (76) we have 


nı (па = 1) (m — 1) Nu (Ма — D 110 
s Tus j^ xa = | а] - nS — 1 Da (o 


Substituting in (108), we have 
TN Nin ЛШ Ж. Nu? 
Gro eral lero % ај ta Mj № 


N, 2 


= №(№ — 1) 
N N, —N, Ny (Ni — № NuYg l 
Ещ шары iD "e + NiN NM — 1 1 (ж) 
Мы (№ — №) fg( 1) _ 11 (111) 
= OU — 1) (eG) 7 xj 
Now let 
ny = np, + є 
where 
Е (9) =0 
апа 


(а) =" ‚п. p(l — р) 
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Since є will be small as compared with ap, with a probability 
approaching 1 as п becomes large, we may write, neglecting terms 
in 1/r? and higher powers, 


T — fı е en oe дед) 
mn ma па тер | 


TORE 
n пр, 


to a first approximation, or more precisely 


Е(2) = uw + AL 1 2 a =p} (112) 
Substituting from (112) in (111), we have 
y (ш) = d Na { vs. Ды. 
т Рі М – 1 np, Мр 
М—"п 1 


+ a =) | (113) 


Replacing № — 1, № — 1 by №,, N respectively, we get 
N—n 1 1 
ДОЗЕ QE „ Би (1 — £2) 114 
Ы n N n p nm Рі (089 


where V, denotes the first approximation to the variance, апа 


тау 2, N-—n уз! ч 3 FA + .- 
” m ESSE! D 2 PEL ta | pd} 
(115) 
where У; denotes the second approximation to the variance. 


2a.20 Quantitative and Qualitative Characters 


We will now extend the preceding theory to the situation involv- 
ing both quantitative and qualitative variation together in the same 
problem. This situation is of common occurrence. Thus, in a 
population survey we may be required to estimate both the 
proportions of families in different income groups as also the 
total income in each group. The tabulation on a sampling basis 
of census results presents similar problems. Suppose punched 
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cards, each one representing the data of different holdings in an 
agricultural census, are available for sorting and tabulation. Further, 
suppose that the holdings are to be classified according to their 
size in five classes: 0-2-5, 2-5-5-0, 5-10, 10-25 and larger than 
25 acres. We may be required to estimate proportions of holdings 
in the different classes and also the total area under any specified 
crop in each class. In all such problems, it is convenient to 
select a sample of л out of the total of № by the method of 
simple random sampling. We have already considered the problem 
of estimating the proportions in the different classes. The prob- 
lems for consideration now are: А 


(a) to obtain an estimate of the total (or the average) of the 
quantitative character under study in each class; 


(b) to obtain the standard error of the estimates in (a); and 


(c) to predict the size of the sample л required for estimating 
the total in each class with a given standard error. 


Without loss of generality, we may consider these problems in 
relation to an actual example. 


Example 2.2 
It is proposed to estimate the area benefited from irrigation 
wells said to have been completed under the Grow-More-Food 
Campaign from the data given in Example 2.1. The sample is 
proposed to be selected by the method of simple random sampling 
from the population of wells reported to have been constructed. 
The number of wells actually constructed is not known. How 
large should n be in order that the area benefited may be 
determined with 5% standard error 7 
Let 
N = Number of wells reported to have been constructed 
under the Grow-More-Food Campaign; 
р = Proportion of wells in the population actually con- 
structed; for convenience we will designate these wells 
as belonging to class 1; and 


q = Proportion of wells not completed. We will designate 
the wells under this category as falling in class 2. 


|| 
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Evidently, 
Np + Ма= № 


Let, further, yy, S,? be respectively the population mean area 
benefited per well and the population mean square for class 1. 


Let п, denote the number of wells falling in class 1 when 
a random sample of n is chosen by the method of simple random 
sampling from the population М, and Pn, be the corresponding 
mean area in the sample. Our first problem is to obtain the 
estimate of the total area benefited, namely, Np? y, 


Since the sample is chosen by the method of simple random 
sampling from the entire population, the sub-sample n, can also 
be considered a random sample from the corresponding population 
of Np units. It follows that for a given л,, Jn, Will be an unbiased 
estimate of Fy, 


It is, therefore, natural to take N .n,/n.Pn, as the estimate of the 
total area benefited from the completed wells. It is easy to show 
that this is an unbiased estimate of Npy,. For 


n = N = 
E(N. A y.) EXE L G кт ӯ) 


| 
~ 
s 
t3 


= Мр)у, (116) 


The next problem is to obtain the sampling variance of N-n,/n- In, 
We have 


л) 
- 8 fe [Ey] -eC 


ni 2 
= = 8 у,) |- мра (117) 


п 


| 
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Now, for given лу, 


ЕП» »] | = Е {2 m} БӘРІ m} 


Np Np 
= mL + Np(Np- 2179) 


Hence 


Np 

1 p» E (mm — D) 

Y A (n.n 

* Np (Np —1) (5%) n 
БЕ (119) 

Using the results, already established, namely, 
E (nj) = np 
and 


E (ш.т- 1 = wA nw ) n(n—1) 


we obtain from (119) 
TC Р Мр ( а 
п ЈЕ \ 
[Р НЕ дог ic porn 


VA = ш \ 
арене Би 


іші 
Np 
n N—n қ п(п-1) ws ары 
“км-і (Es) +уф=) * 
іші 
= 2 ш — 1) Se + Мрўм,% 
= уус? )8 


n(n—) уар: (120) 
+ NNI) Pn? 
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Substituting the result in (117), we have 


Уфу ст} = = (Мр – 1) 82 + Мру) 
Р а) NP yy? — N?p?j 
= М) (Мр — 1) Sè 
+ руа {бусту + ASD р) 
=n) N* P peta, ° 


ЖӨ 1) (Np—1) S?4- – n(N —1) Ун, 
(121) 


An alternative and instructive way of deriving the above result 
is to proceed as follows. We have 


- лі 
Mn, = Xy 
я 
= Ху, 
since n — n; of the y values are zero each, 


= ny, 


It follows, therefore, that 


v] = {Мур 


= (N= 1) (NP — DSP + Np — p) yu?) 
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For purposes of simplification, we will assume that Np is large 
enough to permit the following approximations: 


= л 
Мр = 
апа 
N—I 1 
N = 


Using these, we obtain 
N = N(N—n 2 5 2 
{Жош} 0 рва фра – 0 (22) 
To predict the size of sample required for estimating the 
character with a given standard error, we need the expression 
for the relative variance. This is obtained by dividing (122) by 
Мерз, and is given by 
N—n 1 (G2, 1—p) (123) 
N n Із T 
where C}? denotes the square of the coefficient of variation of the 
area irrigated from a well in class 1. For N large, the relative 
Variance is given by 
1 TG? 0 (124) 
n { p пи p P) 
An idea of C2 may be formed from previous experience. Let us 
assume it to be 0-5. Since will need to be large as p decreases, 
P may be assumed to be the smallest of the values consistent with 
€xpectation and previous experience in order that we may err on 
the safe side. Table 2.3 gives values of n for different values 
of p and for Cj? = 0-5 in order that the area benefited may be 
estimated with 5% standard error. 
Two sets of values of п are given: 
(i) those obtained from (123), for N — 5000, and ; 
(ii) those from (124), i.e., after neglecting the finite multiplier. 
It will be seen that a sample of 690 wells will be required for 
estimating the area benefited with a degree of accuracy as large 
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as the one specified or larger, assuming of course that p does 
not fall below -5. Ignoring the finite multiplier altogether would 
imply a loss of nearly 15% of the information. 


TABLE 2.3 


Sample Size Required for Estimating the Total in Class 1 
with 5% Standard Error 


p 5 6 7 8 9 
(i) Ба 690 536 419 327 253 


(ii) Ұй 800 600 457 350 267 


We may call attention to one important point. It will be seen 
from (122) that sampling units falling in a given class alone 
contribute to the information in that class. It follows that this 
formula for the sampling variance is applicable to any class, even 
when the population consists of several classes, p in that case 
representing the proportion of units in the population falling in 
the given class, and (1 — p) representing the proportion of units 
falling in all the remaining classes together. It also follows that 
the value of required for estimating the class areas with a specified 
accuracy or higher is the value corresponding to the smallest of 
the p values. 


B. SAMPLING WITH VARYING PROBABILITIES 
OF SELECTION 


2b.1 Introduction 


The theory considered in the previous sections is appropriate 
for the method of simple random sampling in which the selection 
probabilities are equal for all units in the population. Although 
this method is by far the most common method of. sampling, 
unequal probabilities of selection are sometimes used, and give 
more efficient estimates in the sense of giving population estimates 
with smaller standard errors. In the following sections we shall 
give the basic theory of this method. 


There is one difference between the method of simple random 
sampling and that of sampling with varying probabilities of 
selection. In the former, the probability of drawing a specified 


eon вина, У 
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unit remains constant at each draw; in the latter, it does not. 
Let, as before, P; denote the probability of selecting the i-th unit 
of the population at the first draw (i= 1, 2, ..., М), so that 


N 
2 Р; = 1 and P;, denote the probability of drawing yi at the 
r-th draw (ғ = 1,2, ..., n). 


Clearly, 
Р, = Р, (eae М) (125) 
апа 
P, = (Probability that у, is not drawn at the first draw) x 
(Probability that у; is drawn at the second draw) 
N 
а, X (Probability that y; is drawn at the first draw) Х 
i(AD=L 
(Probability that y; is drawn at the second draw) 
= í Pi Pi = м P; 
{һ jg te Tem usq 
Р; ЕНЕ ге ) 
+ Ре 1— Pig + Py WEEJO- 
= Р P 
4” p I = Е rag] Р, 
і=1 
P, 
- mom 126 
{5 —| sj Pi (140) 
where 
= D 
S= к а 127) 
TP ( 


Pi, unless Р; = Ш. The 
babilities of selection is 
mple random sampling. 
the theory is to replace 
ade, so that Р =P; 


It is thus seen that Pj, is not equal to 
theory of sampling with varying pro 
Consequently more complex than that of si 
One way of introducing simplification into 
a selected unit before another draw is m 
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for all r. We shall first present this simplified theory appropriate 
to the procedure of sampling with replacement and then consider 
briefly the theory appropriate for sampling without replacement. 


2b.2 Sampling with Replacement: Sample Estimate and its 
Variance 


Define a variate z given by 


2, = ЈЕ (128) 


and consider the simple arithmetic mean of z values in the sample 
given by 


TES. 
%-: Уа (129) 


It is easily shown that Zn provides an unbiased estimate of the 
population mean y,. 


For, in sampling with replacement, 


N 
Е (2) = 2 Рд, (130) 
which we may denote by 2, at each draw. 


Substituting for z; from (128) on the right-hand side of (130), 
we get 


(131) 
It follows that 


ЕФ) =), EG) 
= Jy 


(132) 
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To obtain the sampling variance of 24, we have 
И (2,) = Е {2, — Е (2) 
= Е (2,2) — {Е (2,)р 


= E һ Е (22) + y кее} — #2 


E (zizj) —E(z):E() 


Since draws are made with replacement. 


On substituting for E (2) and E (zj) from (130), we have 


Е(г2) = 2 ра) (2 ра) 


= 23 


Substituting from (134) and (135) in (133), we obtain 


n 


N 
VG) = |, п 3a Р? d n(n— раң m 
іші 
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(133) 


(134) 


(135) 


(136) 
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where 
ы 2 
e = J Р, (2; == Ad (137) 


and represents the variance of a single 2. 


It will be noticed that the finite multiplier does not enter into 
the expression for the variance of the estimate when sampling is 
carried out with replacement. When P; = 1/N, we have 


Ey = Уа 
and 
1 N 
of = N Ж Оо» — ўл)? 
i-i 
= o? 


and V (jn) is given by the familiar expression 


ДО 
ELE S 
N Cm (3s) 


Lastly, we remark that when the selection probability is propor- 
tional to the value of the variate, in other words, when Р; is 
proportional to у;, say, P;—yi/u, z assumes а constant value for all 
i, and in consequence (136) reduces to zero. In practice che values 
of the variate will, of course, not be known in advance but values 
of another variate correlated with the variate under study may be 
known. We may, therefore, expect that when P; is proportional 
to the measure of size of у;, the estimate may be considerably 
more efficient than that based on simple random sampling. 


2b.3 Sampling with Replacement: Estimation of the Sampling 
Variance 


Consider the mean square of z;’s obtained from the sample, 
defined by 


ё= т aay (139) 


T—— ын 
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Expandi 
panding, and taking expectations, we get 


Els?) = 1 = 
ел gp? И ec 
= l 2 9 
п— 1 | Е (22) —nE вз} (140) 


Now, by definition, 


so that 


EG) = И(2,) +22 


== E^ + 2 2 
E ы: (141) 
' Substituti 
uting from (134) and (141) in (140), we get 
Bes = E 
(8.2) = E | ), Р22— o = ең 
3 — (142) 
S showi s А 
the owing that s2 is an unbiased estimate of cz. It follows 
(143) 


Est. V (z) = *- 
n 


2b. 

4 Sampling without Replacement 

ng without replacement, 
he probability of draw- 
n draw is not in 
t the first draw. 
1] change with 
r the theory 
when the 


Wi 
when td already noted that in sampli 
Ing a Specified probabilities are unequal, the Р 
general e ified unit of the population at a give 
It fs, i, to the probability of selecting it à 
the Осен Ши (ДЕ expected value of а variate wil 
of вабна т draws. In this section we shall conside 
sample с ng without replacement for the simple case 
5 onsists of only two draws. 
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One device of overcoming the difficulty of changing expectation 
with the successive draws is to define z differently for the successive 
draws. Thus, let 


, Ji 
ME МР, (144) 
апа 
of з i 
"UNE. 


= Jj 
- т; 
y (s = rx) P, 


where у; and yj are values of the units drawn at the first and 
second draws respectively. 


(145) 


Clearly, 
N 
r Ji 
Е(г) = Pu 
, + NP, 
= Jn (146) 
Also, 
b 
E(z)-— "E ч Б 
a > ше 
— N (s i Б) Р, 
= Ju (147) 


LI 
It follows that a simple arithmetic mean of лү and 2, will provide 
us with an unbiased estimate of the population mean. 


An alternative estimate, in which z is defined independently of 
the order of the draw and which is unbiased, can also be formed. 


Let 


2 Ji 
z = 5 SS -- (148) 
N = Бе 
(5 #1 = рај, 


and consider the estimate 


Zu) = у У 5, | (149) 
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Clearly, 
N 
Е (Фи) = $ X Е (а) 2; (150) 
i=1 
where E (a;) stands for the probability of including у; in a sample 
of two, and is given by 


Е (а) = Probability that y; is drawn at the first draw 
+ Probability that y; is drawn at the second draw 


Using (125) and (126), we have 


Eu) = Py 1 {5 = гер} Р, 


Substituting for Е (a;) from (151) in (150), we get 


N 
Pi 
Eu) =F J) [S1 — pg) Ра 
= 2", say (152) 
Substituting for 2; from (148) in (152), we have 


N 
2 Vi 
Е (A) = + ) Aae ви 
3 N Pec 
ГЕП (5+1—-г—=›). 


5 z 
x(st1- jg) ^ 
= jy (153) 
thus showing that Zj,22, provides an unbiased estimate of the 
population mean jy. 
We shall next obtain the variance of 2-2). We write 


Е (2-2) = Е (Zrnca) = Py 


-L L5 E (a) z? + » Е (a,a;) za — Эм (154) 


i-i ізбіті = 
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where E (ajaj) stands for the probability of including y; and y; in 
a sample of two and is given by 


Е (аа) = Р.Р, Es РР, 


Р, 


Р 
= Pi Го + Р; ` 


in (155) 


On substituting for E (a;) from (151) and for E (ajaj) from (155) 
in (154), we have 


У Guan) = if Prise E rg] Р? 


i=1 

ы 1 1 
+ rala *r-g] га | = (56 
172ј=1 


То obtain the estimate of the variance, we have from the 
definition of the variance 


Est. V (ға-») 2 іа) — Est. 22 (157) 
Now, from (152), 


N 
нта тег 
іші 


Ф F(s debes гъ (5 doas г) Peza) 


іздіші 
It is easily shown that (158) 
ју (ен - ita) ra 
N 
= ii Е (а) (5 +1 = гв) Рай 
N 
= Эбн 25) Pze -— 
іші 
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2 EE = i) (s +1 = T) 


3 


x 


- È ram (5+1 = (8) Gn nia) 


ізбіті 
х ; 242; 1 
I-A, pg 
N Р Р 
- É T 4 эы і „P. z.z 
= > (84 1 rx) (s +1 pg) hir 
ізбіті 
(160) 


It follows from (157), (158), (159) and (160) that 


ч Р 
Est. V (брз) = 2 а-ә — 4 > (8 aa = ED) Рай 


2b.5 Sampling without Replacement—General Case 


The formal general expression for an unbiased estimate of Jy, 
for any sample size, should now be obvious. 
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Let 


БЕСТЕ 162 
% = МЕ (a) ini 


Then the simple arithmetic mean of z's provides an unbiased 
estimate of y,, for, 


a Қ%- 5 
Е(2,) = Е | МЕ iol 
1 N 
„ My 
п 2, EO МЕ (а) 
іші 


= Fy (163) 


To obtain the sampling variance, we write 


Уб) = Е (2,2) — Py? 


| 
з Es 
ји 
L4 
A 
+ 
на 
N 
м 
UY 
5 ' 
“ 


1 
т 


|| 


N N 
һ Е (а) z? + 2 E (a;a;) | ex (164) 
іші ізбіті 


The expression appears to have been first given by Narain (1951) 
and Horvitz and Thompson (1952). 


Substituting for z; from (162) in (164), we have 


. М N 
maL ve ^ — Е(ша) " 
Дм d Еа) * £4 Ela) ЕҚ) 47 ~ 168) 
іші i#j=1 


which for n = 2 reduces to (156). 


An estimate of the variance V 


(En) is easily derived. We 
write 


Est. И (2,) = 2,2 — Est. jy? 
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(166)* 


The values of Е (а) and Е (aja;) depend upon the choice of 


н Е А А Я 
values for Pi, E “О Ie. п): Explicit expressions have been 


given for n = 2 in the last section and expressions could be written 
in a similar way for any sample size. It will be noticed that if 
E (a) is proportional to y; the estimate Zp reduces to a constant 
with zero variance. Values of у; will, however, not be known 
but values A; of a character correlated with y; may be known. 
It follows as a near approximation that the optimum values of 
P;, would require that E (a;) should be proportional to 4. The 
explicit solution of this problem is, however, difficult and will not 
be discussed here. One result may be mentioned, namely, that 
it is possible to choose E (aj) proportional to 4% only if the 
latter are not too heterogeneous (Narain, 1951). 


an alternative estimate of the 


* Yates and Grundy (1953) have developed 
n (166). They 


variance V (фи), which appears to be better than the one given i 
use the fact that 


N° 
© (E (ni) — E (а ЈЕ (а) = — E (а) {1 — Е(а)) 
to recast the expression for the variance 
squares of the differences of the z's. Thus, 
eee [2 l-E) а, S Ешай — E (a) EG) vor | 
№ Limi E(a) њЕј=а E (ai) E (aj) 
and using the above result, 


V (гр) as a linear function of the 
(165) can be written as 


Substituting for y; in terms of 2, from (162), 
they obtain 
ч 1 N 2 
VE) =z Х (Ela) Еб) – E(ajap} (а — 2)" 
217 Еј=т 
It is easy to see that ап unbiased estimate of V (Zn) is g 


NE. {= (ай El) Ean Gi- 4) 


iven by 


Est. V (2,) = — 
Gn) 2m E E (aia) 


which, unlike (166), is seen to be а linear function of th 
differences between the 25 in the sample. 


e squares of the 


тар » - чи 


72 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


Lastly, we mention one system of selection probabilities due to 
Midzuno (1950) which, while by no means efficient in the sense 
indicated above, has the merit of simplicity. Here the units at 
the first draw are selected with unequal probability, but at the 
second and subsequent draws they are selected with 


equal 
probability. ў 
It is easy to see that under this system 
Е (а) = Р, ap D. + Pig toot P,, 
1 1 
= P, + (1 =P,) N—1 + (1 = 224 = Ра) Fo Mess 
1 1 
= 0-Р) у= + {1—Р,—@—Рд yl 
1 
“12-2 ч" 
п 1 
== Nai ==) 
М- п — 1 
каны Уат (167) 
while 


E (a, | Fa = Р.Е (а, | a) + PLE (a, | ај) 


F8 55.935 (ss 1) 


n—1 


Мі бут +Њ4—вр—Р) 
x (—D(0—2) 
(N — 1) (N —2) 
_n—1 Nen = 2 
EI {wag Corr) + 12) (168) 


Incidentally, we note that under this system the probability of 


drawing a specified sample is proportional to the total measure of 
the units included in the sample. 


1 


10. 


David, F. N. and 
Neyman, J. (1938) 


Sukhatme, P. V. (1938) .. 


—  — — (1944) 


Fisher, R. A. and 
Yates, F. (1938) 


Neyman, J. (1934) 


Sukhatme, P. У, (1935) .. 


Bartlett, M. S. (1937) 


Narain, R. D. (1951) 


Horvitz, D. G. and 
Thompson, D. J. (1952) 


Midzuno, H. (1950) 


BASIC THEORY 2S 


REFERENCES 


“Extension of the Markoff Theorem on Least 
Squares," Statist. Res. Mem., 2, 105-16. 

*On Bipartitional Functions," Phil. Trans. Roy. 
Soc., London, Series A, 237, 375-409. 

“Moments and Product Moments of Moment- 
Statistics for Samples of the Finite and Infinite 
Populations," Sankhya, 6, 363-82. 

Statistical Tables for Biological, Agricultural and 
Medical Research, Oliver and Boyd, Ltd., 
London. 

“On the Two Different Aspects of the Representa- 
tive Method: The Method of Stratified Sampling 
and the Method of Purposive Selection,” Jour. 
Roy. Statist. Soc., 97, 558-606. 

“Contribution to the Theory of the Representative 
Method,” Jour. Roy. Statist. Soc. Suppl., 2, 
253-68. 

“Sub-sampling for Attribute: 
Soc. Suppl., 4, 131-35. 
“On Sampling without Replacement with Varying 
Probabilities,” Jour. Ind. Soc. Agr. Statist., 3, 

169-74, % 

“А Generalisation of Sampling without Replace- 
ment from a Finite Universe," Jour. Amer. 
Statist. Assoc., 47, 663-85. 

“An Outline of the Theory of Sampling Systems,” 
Ann. Inst. Statist. Math., Japan, 1, 149-56. 


s," Jour. Roy. Statist. 


a 


14 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 
APPENDIX 
Tables of sg (P, Q)/x1! хо!... 
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а) (2) (1°) (3) Q1) (13) 
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P 2) 1 1 
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Q 
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(5) 1 5 10 10 15 10 1 
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est w=2 w=3 
Q Q 
а) о) аз Gy ер u 
P O: 1 Р О) 1 (3) 1 
Р (01) -4 
аз —1 1 (5 2 1 
№ = 4 
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w=6 


0 
(6 (5) (4) 09 (418) (321) (29 (318) (2218) Qr (19 


(6) 1 

(51) — 1 1 

(42) — 1 4 1 

(3°) — 1 А » 1 

(417) 2 —2 =1 а 1 3 2 а Б AT 
P (321) $ ей 1-1 > 1 

(25) 2 em А , " 1 1 

Q1) — 6 6 3 2-3-3 5 1 

(2919). — 6 4 5 2-1-4 -1 Қ 1 


(219 24 —24 -І8 -8 12 20 3 -4 -6 1 
(18) —120 144 90 40 -90 —120 —15 40 45 —15 1 


SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


80 


I iz— 51 OL SOI- O%— 0@— 015 08с 09 FOS 0%- %05- 0%- 004 (1) 
1 ð= = 4 05 ог $£— b-p 06- (9- 02 vg 01 ozi- (812) 
А 1 i BS ug a ES a8 9 SI 9 vl — = зр= Fe (6150) 
s à қ І à 9) == sess ЈЕ 8 e e v= = we Fe GIE) 
: : ы е 1 ` à Ee a ge v Е 9 c 9) = (10) 
в * y * Ы І : је = — qe * £ t 9 = (Ize) 
: i * 5 : s 1 t * t= de @ € 9 9 = (It) 
Е s 5 : è à s 1 š қ | = @= 4 c GTE) d 
4 . . . . . . 1 . . 7 – . І- Z (Is€) 
Е ы s ы Қ : 3 : 3 1 s із үт = ји ко ас) 
. . . . . . . . . в 1 E I — 7 - 7 GIS) 
; . ; ; s у : : қ 3 А I А « M (ер) 
. . . . . . . . . ` . . 1 І = (zs) 
Р ; я у А : = ‘ " А У А 4 1 је (19) 
F ; i ; В Р : > i Я В А 5 А i W 
а) GIO (Is) (6190 (T (о (8190) GTE) (БО (о Gro (Е) (zs) (19) [0] 


[7 


L=“ 


(б а) 58" 19 үш fo sayquy 


BASIC THEORY 81 


Tables of ту! п,!...2; (P, О) 
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Tables of ті! тз! ... gs (P, Q)—(Contd.) 
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CHAPTER Ш 
STRATIFIED SAMPLING 
A. SELECTION WITH EQUAL PROBABILITY 


3a.1 Introduction 


We have seen that the precision of a sample estimate of the 
population mean depends upon two factors: (1) the size of the 
sample, and (2) the variability or heterogeneity of the population. 
Apart from the size of the sample, therefore, the only way of 
increasing the precision of an estimate is to devise sampling 
procedures which will effectively reduce the heterogeneity. One 
such procedure is known as the procedure of stratified sampling. 
It consists in dividing the population into k classes and drawing 
random samples of known sizes, one each from the different 
classes. The classes into which the population is divided are called 
the strata and the process is termed the procedure of stratified 
sampling as distinct from the procedure considered in the previous 
chapters, called unrestricted or unstratified sampling. An example 
of stratified sampling is furnished by the survey for estimating 
the average yield of a crop per acre in which administrative areas 
are taken as the strata and random samples of predetermined 
numbers of fields are selected from each of the several strata. The 
geographical proximity of fields within a stratum makes it more 
homogeneous than the entire population and thus helps to increase 
the precision of the estimate. In this chapter we shall consider 
the theory applicable to the procedure of stratified sampling. 


Stratified sampling is a common procedure in sample surveys. 
The procedure ensures any desired representation in the sample 
of all the strata in the population. In unstratified sampling, on 
the other hand, adequate representation of all the strata cannot 
always be ensured and indeed a sample may be so distributed among 
the different strata that certain strata may be over-represented and 
others under-represented. The procedure of stratified sampling 
is thus intended to give a better cross-section of the population 
than that of unstratified sampling. It follows that one would 


» 
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expect the precision of the estimated character to be higher in 
stratified than in unstratified sampling. Stratified sampling also 
serves other purposes. The selection of sampling units, the loca- 
tion and enumeration of the selected units and the distribution 
and supervision of field work are all simplified in stratified 
sampling. Of course, stratified sampling presupposes the know- 
ledge of the strata sizes, i.e., the total number of sampling units 


in each stratum and the availability of the frame for the selection 
of the sample from each stratum. 


It is not necessary that the strata be formed of geographically 
contiguous administrative areas. Thus, in yield surveys, the fields 
may be stratified according as they are irrigated or unirrigated and 
separate samples selected from each. In a survey for estimating 
the acreage under crops, strata may be formed by classifying the 
villages according to their geographical area instead of on the basis 
of geographical proximity. The principles to be followed in strati- 
fying a population will become clear in the subsequent sections. 


3a.2 Estimate of the Population Mean and its Variance 
Let М; denote the Size, ie., the number of units in the i-th 


stratum and л; the size of the sample to be selected therefrom, so 
that 


Now the population mean to be estimated is given by 
1 k 
Jy мә жу. Му Ni 
4=1 


= Z РЈ (1) 


бы. 


«Since n; is a simple random sample from Nj, it is natural to take 


X pix B 
іші 
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as an estimate of the population mean and denote it by Jw as it 
is the weighted mean with strata sizes as the weights. It is easy 
to see that this gives an unbiased etsimate of the population 
mean, since 


Е (94) = Ум 
and, therefore, 
k k B ы. _ 
EG) = Е( X рдиј = РЕФ) = Боди = O 
іші іші =1 
To obtain the sampling variance of ју, we have 
V (Pe) = EQ, — Fn)? 


к k хз 
E( ЙӨ 20 рм) 
dizi i=1 


k x: " 2 
= Е { 2 Pi Oui — о] 
=1 
" 5 Bur s 
= Е iz PE On — In? + 27 рр, Фа“ Ун) а) 
i=1 Aja 
5 x aa: WES 
= Ўра), E PPE Guu) 6,58) O 
i=1 ізбіші 
Since the sample in the i-th stratum is а simple random sample, 
n Ni 


where S? is the mean square of the population in the i-th stratum 
defined by 


pod 
S? = Nod D Qi — yu? (6) 
i 
j=1 


The value of the second term in (4) is clearly zero, since samples 
are selected independently from each stratum. We therefore have 


У Gals = » (==) реза (7) 
іші 


The subscript ‘S° in (7) helps to indicate that this variance relates 
to stratified sampling and will be used whenever necessary. It will 


Еба =й? = (+ — г) s? (5) 
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be seen that the variance depends on 5;5, the variabilities within 
the strata. The result suggests that the smaller the S, і.е., the 
more homogeneous the strata are made, the greater will be the 
precision of the stratified sample. 


It can be shown by a slight modification of the Markoff 
theorem mentioned in Section 2a.3 that Jw is the best unbiased 
linear estimate of Jy. The proof is left to the reader. 


3a.3 Choice of Sample Sizes in Different Strata 


The above expression for the variance of the estimated mean 
in stratified sampling shows that the precision of a stratified sample 
depends upon ns which can be fixed at will. Following the 
principle explained in Section 1.3, we can now so choose the 
nis as to provide an estimate of the desired precision for a mini- 
mum cost. Alternatively, for any given cost, we can choose the 
ns So as to minimize the variance of the estimate. The alloca- 
tion of the sample to the different strata made in accordance with 
either of these principles is said to be based on the principle 
of optimum allocation. The concept, as it is known to-day, was 
introduced by Neyman (1934). 


Just as the variance is a function of the sample sizes, so also 
is the cost of a survey. The manner in which the cost will vary 
with the size of the total sample and with its allocation among 
the different strata will depend upon the particular survey. In 
the simplest case, such as in sampling from punched cards for 
tabulating the results of a census on a sampling basis, the cost 
will be directly proportional to the number of units in the sample. 
In yield surveys in India, where the field work is carried out by the 
local staff in the course of their normal duties and the major item 
in the cost of a survey consists of labour charges on harvesting 
of produce, the cost of the survey is found to be approximately 
proportional to the number of crop-cutting experiments. Cost 
per experiment may, however, vary in the different strata depend- 
ing upon the availability of labour. In this situation, the total 
cost may be appropriately represented by 


k 
С = 5 cn, (8) 
i=1 


l 
| 
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where с; is the cost per experiment in the i-th stratum. When 
с; is the same from stratum to stratum, say с, the total cost of 
a, survey is given by 

С=сп (9) 
Where travel, field staff salary and statistical analysis are to be 
paid for, the cost function will obviously change in form. We 
shall later on give examples of more complex cost functions 
appropriate to different problems. 


To determine the optimum values of л, when the cost function 
is represented by (8), we consider the function 


b = V Ou) FHC 


where ш is some constant, and note that 


V Ge) uC = у (+ — м) vise |} cal 


k 
= үҙ С + uen; — 2р5; М) 
QU 


i=1 


— T ез 
+ 2p S; Мис, — N, pis?) 


у 


k 2 k ЗА 
= У (28, — ves) + 2) р, Vie 


ә іші 


SU 
= N, реза 
іші 
Lj 


s + terms independent of n; (10) 


Clearly, V (Pw) is minimum for fixed C, or the cost C of a survey 
is minimum for a fixed value of V (w), when each of the square 
terms on the right-hand side of (10) is zero; or in other words, 
when 
n, с M РЕ (11) 
i 
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Equation (11) shows, what is also otherwise obvious, that: 


(i) the larger the size of the stratum, the larger should be the 
size of the sample to be selected therefrom; 


(ii) the larger the variability within a stratum, the larger should 
be the size of the sample from that stratum; and 


(iii) the cheaper the labour in a stratum, the larger the sample 
from that stratum. 


To obtain the exact value of n;’s, we evaluate ми, the constant 
of proportionality, so as to satisfy the condition of fixed cost or 


fixed variance. In the former case, we substitute from (11) in 
(8) and obtain 


k 
PS: 

С = = с 

Ч єє} Мис, 4 


where С, is the budgeted amount within which it is desired to 
estimate the mean with the maximum precision, giving 


ae zo (12) 
ve PP У 
Непсе - 
С ‘ 
My = —— Peca (13) 
ve, 2 раса) 


When ac (= 1,2. ла К), апа Consequently the cost of the 
survey is proportional to th 


n =n айз. 


14 
рв, (14) 


In other words, for a 
accordance with (14) 
maximum precision. 


rediscovered independently by Neyman ( 1934). The allocation of 
the sample in accordance with this formula is, however. known 
as Neyman allocation in literature. i 


e size of the sample, n;'s are given by 


~ 


4 
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When the population mean is desired to be estimated with a 
given variance, say Ур, at a minimum cost, we evaluate the con- 
stant of proportionality by substituting for n; from (11) in 


2 (L - p) reSt = n аз) 


ізі 


and obtain, since р; = Ni/N, 


k 
1 27 DS: М/с; 
у (16 
Va + N DS? 
Hence ja 

k Р 

2 PS: М/с, 
Я = BEE he “а (17) 


When с; = с, (17) reduces to 


n; = р5; : x 
V ty) PS (18) 


so that the minimum sample required for estimating the mean with 
fixed variance V, is given by 


k 
DSi 
ші 


п = — = A (19) 


k 

1 ° 

Vy + N ўч PS? 
іші 


3a.4 Size of Sample for Estimating the Mean with a Given 
Variance under (a) Optimum (Neyman), and (b) Pro- 
portional Allocations 
We have seen that for ns arbitrary, the variance of the mean 
is given by 


k 
к ) ' 1 ШІ; ге 2 (20) 
И фр = Є == x) рз | 
i=1 


k 2 
A һ8) 
іші 
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Under Neyman allocation we have 

_ npiSi 

TU E c 21 
2) PS; с 
i=1 


n; 


Substituting for n; from (21) in (20), we get 
k 
E » ps; 1 
y. — ЖІ; әсе 2 с 2 
V (ради 25 np; N, DPS; 


(22) 


| 
ше i 
“~ 
У 
КЫ 
Lf 
~ 
| 
ың, 
ї 
Y” 
їл 


where the subscript N symbolises the variance of the mean under 
Neyman allocation. 


For proportional allocation of the sample among strata 


n, = пр; (23) 
Substituting in (20), we get 


k 

a ao 
ја (ар = пр) ов: 
ізі 


У (Рој 


|| 


| 
з! 
| 
>- 
“ме 
[Mr 
= 
2 


k 
N= 4 
= юж дрва Qn 


where the subscript P symbolises the variance under the propor- 
tional system of allocation. 


It follows from (22) that for estimating the mean from a 
stratified sample under Neyman allocation with a given variance 
Vo, the size of sample required is given by 

k 2 
(2 DS; 


n _—_— 


1 E (25) 
Vo + N X PS? 
1=1 
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and that under proportional allocation, л is obtained from (24), 
being given by 


a= (zi (26) 


Equation (25) is seen to be identical with (19); as is to be expected. 


3a.5 Comparison of Stratified with Unstratified Simple Random 
Sampling 
We shall discuss the efficiency of stratified sampling first under 
Proportional allocation, then under Neyman allocation and finally 
under an arbitrary allocation. 
(a) Proportional Allocation 
We have seen that 


у 
У (yy = G = x) үй DS? 27) 
іші 


The variance of the mean under unstratified simple random 
Sampling may be written as 


2 1 lY с 
V Gus -( im x) S 


For purposes of comparing (27) with (28), 
express S? in terms of 52. 

_ Now the total sum of squares i 
into two parts, viz., (a) within strata, and (0) betw 
accordance with the identity : 


Q8) 


it is necessary to 


n the population can be split up 
een strata, in 


ом A 3 м 
> У (уу = Jui Ум — Ўқ)? 


|| 


5 Ni 
25 (уу — Iw)? 


ы i ігі 
: " M М 
= È » Qu = уп) + 2 Ni Oni — ўм)? 
іші іші ті 


This can be written as 


ч 3 2 5 29 
(М-1)8%- Жм, – 082+ ЖМ, (и, — Fw)” cd 
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For purposes of simplification we will assume М; to be large enough 
to permit the approximations 

Ber ul and Er wi 
On dividing by N, and making use of these approximations, we 
get 


k k » екы 
85° 22 X pS? + Ур, (Fx; —Fy)* 
іші іші 
or 
te à ІЗ =: А 
PL e S*— 2 Pi n, — JY (30) 
= = 


Substituting the result in (27), we have 


Y G2» = (5 — y) | s — » D «cte 


Hence, subtracting (31) from (28), we obtain 
k 
3 = М- = "zv 
V бёв—У бы» = ү”) p Gy — ју (32) 


The expression shows that the more the strata differ in their 
means, “the larger is the gain in precision due to proportional 
sampling over unstratified simple random sampling. 


(5) Neyman Allocation 


Here, again, we shall first express the variance of the mean 
under Neyman allocation in a form suitable for comparison with 


the variance in unstratified simple random sampling. We shall 
make use of the identity: 


k = k 3 2 
En - 8,8 = 2 рве —( 275.) (33) 
іші іші іші 


where 


© 


ay 
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. А E 3 
On substituting for( 2 PSi) from (33) in the expression for the 
іші 


variance of the mean under Neyman allocation given in (22), we 
obtain 


1 
N k 1 E. 
E VL ug t. 2 ED 
= Мар) рва–у 2^ i82 (34) 
1=1 1=1 


The first term on the right-hand side of the above equation 
represents the variance under proportional allocation. Hence 


k 
Уба» -VO = 1 9 p, (S, — So)? 6» 


ісі 


Ne see that the larger the differences between the strata standard 


deviations, the larger is the gain in precision of optimum over 
Proportional allocation. Further, on substituting for V (Fw)p from 
(31), we obtain 


[5 
ЖЕЛГЕ Ме z |> — үз; Gig; — In)” 


іі б 
М сү 
ie BES ‚ (8; — 5)“ 36) 
к=з > р (Si ) | ( 
іші 
epresents the 


Since the, first term on the right-hand side of (36) г Р 
we may write 


Variance of the mean of an unstratified sample of n, 


k 
V Gus — V Gay e Sot p p, Gy, а) 
k 


іші 


N me in 37 
+ па да Ae 5 (37) 


recision of Neyman 
pling arises from 
and 


The equation (37) shows that the gain in р 
allocation over unstratified simple random sant 
two factors, viz, (a) differences between strata means, 
(0) differences between strata standard deviations. ) 

The above result is exceedingly helpful in devising a 
of stratification. It suggests that for efficient stratification, the 
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population should be so divided that the differences between the 
strata means and standard deviations are as large as possible. 
This is best done in practice by grouping together like units of 
the population. Thus, geographically contiguous units are usually 
more alike than those further apart. Consequently, the adoption 
of geographical proximity as the basis for stratification is expected 
to lead to a gain in precision apart from being convenient for 
purposes of organization of field work. Оп the other hand, 
experience shows that the gains made from geographical stratifi- 
cation are generally moderate. Another method of setting up 
strata is to use the information on some correlated character as 
the basis for stratification. For example, the size of farm is 
known to be correlated with a number of farm characters like the 
area under principal crops or the number of livestock; stratifica- 
tion by size of farm in agricultural surveys is, therefore, expected 
to lead to substantial gains in precision. Example 3.1 at the end 
of Section 3a.10 will serve to illustrate the magnitude of gains 
recorded from this type of stratification. 


(c) Arbitrary Allocation 


When a sample is divided arbitrarily among the strata, the 
variance of the estimated mean is given by the expression (7); 
whereas, when it is selected as an unstratified simple random 
sample, the variance is given by 


MGR Al _ dcs 
Vus = а н) 8 
We therefore have 


| 
5 2 1 1 
уола roa - (1-4) È (i-o 
іші ^ : 


Substituting for S* from (30), we write Ga 


k K 
5 2 1 1 
V Ous — V Owls = n N n PS? + 25 Di d 


i= 
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А Р 
= і Bi 2 
а E РА (5 2) р, 


is G = x) M Pi Фу — Ру“ (39) 


The second term on the right-hand side in (39) is always positive, 
but the first may be positive, zero or negative, depending upon the 
values of jj. It is positive when the sample is allocated according 
to Neyman allocation and we reach the result (37). It is zero when 


either 

n, = пр; (40) 
or 

т=п JD (41) 


P PS? 


giving us (32). As the allocation departs from (40) or (41), the first 
term may not only become negative but be larger in magnitude 
than the second, thus making a stratified sample less efficient than 
an unstratified sample. The result is important and suggests the 
need for care in the allocation of the sample among the strata. 


3a.6* Practical Difficulties in Adopting the Neyman Method 
of Allocation 


There aie certain limitations to the use of the Neyman alloca- 
tion in practice which will now be pointed out. If more than one 
character is to be estimated from a sample survey, then the 
allocation of the sample into different strata on the basis of any 
one character, using the Neyman method, may lead to loss in preci- 
sion on other characters as compared to the method of propor- 
tional allocation. If, however, the characters are correlated, or 
if certain characters are more important than others, then gains 
in precision on the estimates of the more important characters 
can still be secured by using the Neyman method of allocation. 
However, the more severe limitation on the use of the Neyman 
allocation is the absence of the knowledge of $$. One method 
of overcoming this limitation is to estimate S;’s from a preliminary 
sample of п' (Sukhatme, 1935). These estimates will, however, 
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be subject to standard errors and it is, therefore, possible that we 
would be worse off than if we had used the method of proportional 
allocation. The problems to be considered are then, (а) how much 
would the variance increase, on the average, if the allocation is 
based on estimated values of S;, (b) how does it compare with’ 
the variance from proportional allocation, and (c) how large 
should be the size of the preliminary sample in order that Neyman 


allocation may give a more precise estimate than proportional 
allocation ? 


Let s; represent an unbiased estimate of S;/v, based on a sample 
of size п', у denoting a constant, so that 
E (vs) = S, (—1,2, ..., ) | (42) 


The allocation of the total sample among the different strata will 
now be made in accordance with the formula* 


n, = Pes (43) 
2 Ра 
=1 


Substituting in (7), we obtain 


k 
vds SI x m 1 
. ow -—— —À 2 2 
СЕ Ў, пр;ѕ; Np) ^ S) 
N 1=1 


Sk 


|| 
AS 
D 
ЧӘ: 
=" 
Ll т 
UR 
а s 
жы” 
| 
ы 
3 
т 


This expression involves 5; and we are consequently not in a 
position to say whether this will give a smaller value or not as 
compared to that for proportional allocation, viz., 


У Qu)s = (5 = 3)? PS? (45) 


ж Where, as in this case, the decision regarding the size of the additional sam- 
ple to be drawn from each stratum depends upon the results of the first sample. 
the procedure is essentially what is called sequential sampling, қ 
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We might, however, obtain the average уаше of (44) in samples 
of п' and examine how it compares with (45). 


Now it can be shown (Sukhatme, 1935) that if the variate under 
study can be considered as normally distributed and consequently 
si is distributed as X, the average value of the right-hand side in 
(44) is given approximately by 


HÈ ребра + 8 De sl -7 Dies? (46) 


iFj=1 
where 


8-1 (47) 


Substituting for 0 from (47) in (46), we obtain 


к 2 k 
І 
zirgs” - S ) — S? 
ELV G.)| |, 2 ) pS: н), р 


pore 


i=1 


which, on using (30), can also be written as 


пнде ресе 


= "Dine E 


Zn + (ие ) - È res: (49) 
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The first part on the right-hand side denotes the variance of the 
mean under Neyman allocation when the $; are known. Con- 
sequently, when the S; are estimated from a preliminary sample 
of size n', this variance is seen to increase, on an average, by 


1 k 2 k 
i| (Yos) E 2 ssh (50) 
іші ізі 


Comparing (48) with the value of the variance under proportional 
allocation given in (45), we notice that the condition that the 
Neyman allocation may not, on the average, lead to loss of 
precision as compared to proportional allocation is 


k 2 k k 

1 ос? eu 

a (02 2 - Yos s) n@m- cn 
іші ісі іші 


k 2 E. as 
( 2 p&) — Эрге 
п > іші іші 


k = 
2 2 D (S: – So)? 
=1 


ог 


(52) 


The above result can be derived more simply by an alternative 
method given by Evans (1951). 


Let, as before, 
vs, = 5, + є; 
where 
Е (e) = 0, Е (€?) = уу (s,) 
and. 
уз, = S, + е; 
where 
Е (є) =0, E (ej?) = v?V (sj) 


Then s;/s; can be expressed as 


ЖОРЫ 


- 5; a, & (1- $ S) 
=s(i +3) S, * s? 
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On expanding the right-hand side and neglecting terms involving 
powers of e higher than the second and taking expectations, we 
obtain 


E Ө ~ s: (| + С)? 


where C; is the coefficient of variation of s; given by v 4/ V (spS;. 
Assuming C; = C (j= 1, 2, ..., k), we obtain 


S: S, 
-"| ~ z (+e 53 
"ure (53) 
Taking expectations of both sides in (44) and substituting from 


(53), we have 
Оқа» 


ізбігі 


E SVGa) 


R 


nass eec) 


d: 
52у 


| 
~ 
5, 
= 
A 
9 
=== 
| ae 
ipa 
у 
x 
УЕ 4 
E 
| 
Dag 
Б 
2 
m 
---. 


ien ede 


Clearly, this expression will have a smaller value as compared to 
that under proportional allocation if the sum of the last two 
terms is negative, i.e., if 


S, — Sy 
e PLT (55) 


(Ба 28) -p's Е 
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From Sections 2а.10 and 2a.11 of the last chapter, we know that, 
for samples of и’, C? is approximately given by (f — 1)/4n', so 
that the size of the preliminary sample should be such that 


2 


k 2 а 
ZpS) - p pS? 


" 
: i (56) 
2; р: (S; — 5„)° 

in order that the Neyman allocation may give, on the average, 
a more precise estimate than the method of proportional allocation. 


When f; = 3, the value of л” reduces to that given by (52). 


It will be seen that the larger the variability among 5;5, the 
smaller will be the value of n’. Consequently, unless S;'s are close 
to one another, even moderately small preliminary samples will 
give, on the average, more precise results than proportional allo- 
cation. If, however, the size of the preliminary sample is found 
to be so large as to make the preliminary inquiry not worth while 
and if the study of several characters is included in the sample, 
proportional allocation would be preferable. 


3a.7 Evaluation from the.Sample of the Gain in Precision due 
| to Stratification 


In comparing the precision of the stratified with unstratified 
simple random sampling in Section 3a.5, we assumed that the 
population values of the strata means and standard deviations 
were known. Usually, however, this will not be the case. What 
is available is only a stratified sample and the problem is to 
estimate the gain in precision due to stratification. 


Let т, ћу ...,ny represent the stratified sample. Then the 
variance of the sample mean is 


k 
= 1 
Vos = 2 (s = x) PPS?" 


and its estimate is clearly given by 


k 
МА ae 
Est. У (раје = ) (1 к) ра (57) 
іші 
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where s; is the mean square in the sample drawn from the i-th 
stratum. If the total sample had been selected by the procedure 
of simple random sampling without stratification, then the variance 
of the sample mean would be 

V Qus = — М (58) 
Its estimate cannot, however, be obtained by substituting the 
mean square for the total sample in place of S?. For, although 
within each stratum the sample is selected by the method of simple 
random sampling, the total sample cannot be considered to have 
been so selected from the population as a whole. The problem, 


therefore, is to estimate S?, given 
Jas Jg soia and E Gets sang ee 


We have, from (29), 
k k 
а а i — 2 N y JJ 
St = ут} DS RM ) nO. G9 
iei izi . 


Since s? provides an unbiased estimate of 52, the problem of 
estimating 52 reduces to the estimation of 


Р: Pi Он, — ўм)? (60) 
=1 
from the sample. 
Let 
Jung = Ун + €i (61) 
where 


E(«)==0, and  E(«) =U s 
js еч 


Оп squaring both sides of (61), we have 
Par = ума + 4° + 2y uie; (62) 
Taking expectations, we obtain 


ЕО, = jy? + Ns: (63) 
ч 
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Est. (VO = V Gels} = 5 ) po — 2, РЕ 
eon De би — 3 
= ҚК. (| — p) d (73) 


Taking N/(N — 1) = 1, we obtain 


a s —n 
Est. (V Paus — У у) = = = 13 pis — hs pisi Р 


Nn 


AÈ nouz- MIL =p) 3 


(74) 


The ratio of (73) or (74) to (57), expressed as a percentage, gives 
the estimate of the gain in efficiency due to stratification. 


These results assume a particularly simple form in the case of 
proportional sampling for which y, = Эл. Equation (57) then 
becomes 


k 
Est. (Дь = № NS ғ руз? (75) 


and the first two terms in (74) vanish. The net reduction in 
variance due to stratification is, therefore, given by the last two 
terms in (74). On substituting n;/n for pi, this takes the value 


Est. V (ша — Est. Vds = No” 


x D Фа — Јо) = 9» (1 Е 7 " | i 


When the finite multiplier is assumed to be unity and S? is 
constant, say S? = Sw? (/— 1, 2, ..., k), the best unbiased 
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estimate of the latter is obtained by pooling the sum of squares 
within strata for the sample. We write 


k ощ 
Est. 5," = л =) p Qi =J = 5%; say (77) 
іші j 


since 
k ni 
2 1 =s. A 
E (5.2) = У)? 
(sut) = LM. Е DG Ini) | 
іші j 
k ni 
= - d E) E 2 Qu E? 
n—k 
i=1 i 
1 k 
= 118) (n, — 0 S? 
іші 
= S 
Let 


Ў n Gm — 39$ = (Е — D пзу? (78) 


4-1 


where i = п/к. Substituting from (77) and (78) in (72), we have 


Est. V Gus e h (e — К +D 2 + (7D 153) (79) 
п? 

The quantities swą? and isp? are called the mean squares within and 

between strata respectively, and are best calculated from what 

is familiarly known as the analysis of variance table given below: 


Source of Variation D.F. Sum of Squares Mean Square 
E 5 $2 йз? 
Between Strata .. 4-1 P Gn; — Ў)? "b 
' kon Sums 2 
Within Strata ..  n—k = 2 (у 2а 5о 
= 


ES ӯ у? 
Total ..  n-—1 2 Очу—Ўп) 


106 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


Also, from (75), 


АЛАР (80) 
An estimate of the reduction. іп variance is now given by sub- 


tracting (80) from (79) or directly from (76), and equals 


I = к — 1 
Est. У (в — Est. V Gale = 1 


{fis,? — 5,7) (81) 
The ratio of (81) to (80) gives the relative gain in precision due to 
stratification and equals 


k=l pa – 1) (82) 


The efficiency of stratification is sometimes calculated directly 
by comparing the overall mean square s? with 5,2, the relative 
gain in precision being given by 

БИ TA (n — К) 5,“ + (k — 1) ns? 


ME Lu 


5.8 (п — 1) s? 


__кК—1 ns? __ 1) 


== а 


(83) 


The gain in precision estimated this way is л/(п- 1) times the value 
in (82) which is not likely to be of material difference in large 
samples, provided the sample is allocated in proportion to the sizes 
of the different strata. When the sample is not so allocated, neither 
(82) nor (83) is likely to be satisfactory. The exact expression 
given by the ratio of (73) to (57) should be used in that case. 


3a.8 Use of Strata Sizes for Improving the Precision of an 
Unstratified Sample 


Stratified sampling presupposes the knowledge of the strata 
sizes as well as the availability of the lists of sampling units for 
the different strata. The latter are not, however, always available. 
Thus, the classification of a population by age is known from the 
census tables although the lists of persons belonging to different 
age groups may not be available for the selection of samples from 
the different age groups. Consequently, it is not possible to know 
in advance to which stratum a sampling unit belongs until it is 


= 
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contacted in the course of the survey itself. While the sample in such 
cases has necessarily to be selected by the method of unstratified 
random sampling, we can always classify the selected sample by the 
strata and treat it as if it were a stratified sample. In this section we 
shall examine the gain in precision arising from such a treatment. 


If the sample is to be treated as if it were a stratified sample, 
then ӯ, would be the appropriate estimate of the population mean. 
This is easily seen to be an unbiased estimate of the population 
mean, since 

E(,) = E {E (Pn | n» 
= E (Örn) 


Si 
Hence | 


Е (y) — yu 
For fixed лу, ny, ..., ny, the variance of Fw is given by 


YO = у (+ z A pes? (85) 


varies from sample to sample. 


Consequently, (85) is not directly comparable with the variance of 
an unstratified sample. We can, however, examine how this method 
Compares, оп ап average, with that of an unstratified sample. i An 
exact expression for the average value of (85) cannot be obtained. 
For large values of n, however, and for large N we may use the 
result (112) of Section 2a.19 of the last Chapter and write 


1 1 1 =Й 1 (86) 
Е UE і. из ЗЕ 
У = пр, + пёр? TO Ga) 
Taking expectations in (85) and using (86), we have 
^ 1 1 
EV Gaus} У Lap, + 78 


+ о (i) = ap ths 


i . ; А 
1 n 1 2, seti) S 
X IK EM =) Pi п /- і 
ігі = 


o(a) @) 


(84) 


However n; (i= 1, 2, ..., №), 


É 
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and to the first approximation 


[4 k 
Е У — 8 І ° 
EW быш) yy” ) nS) пр Уа — 00S? (88) 
іші ізі 


It is seen that the first term in (88) is the variance of the mean 
of a stratified sample with proportional allocation. We, therefore, 
see that the adjustment of the results of an unstratified random 
sample as if it were stratified gives almost as high precision as 
a stratified sample with proportional allocation, provided the 
sample is large. The result is obvious otherwise also, for, a large 
sample is expected to be distributed in proportion to the sizes of 
the strata. 


3a.9 Effect of Increasing the Number of Strata on the Precision 
of the Estimate 


The variance of the estimate of the population mean from 
a stratified sample depends upon 


(i) the strata values of p; and S;, and 
(ii) the sample numbers лу. 


We shall assume that 7; is proportional to Pi, so that the variance 
will now depend only on the strata values of рі and Si, being 
given by 


Уба = (2-3) Fose | (89) 


The smaller the strata the more alike will presumably be the 
sampling units comprising them and the smaller, therefore, will 
be the values of S?. We may, therefore, expect that under 
proportional allocation the precision of the estimate will generally 
increase as the number of strata increases. 


For small departures from proportionality, the effect of increasing 
the number of strata is best studied with the help of (88). The 
first term in this equation, it will be noticed, is identical with 
(89) and will presumably decrease as k increases. On the other 
hand, the contribution of the second term to the variance of ӯ, 
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Will increase as Кс increases. For ЈУ large and 52 equal to, say 
54", (88) may be written as 


n 


* с — 1 
EV Coud == {8,5 1s] (90) 


Now, S,, will ordinarily decrease as k increases, but (k — 1) Sw? 
will increase as k increases, at a rate ordinarily greater than the 
Tate of decrease in 5,). We conclude therefore that, for given 
п, а stage in the value of k may be reached beyond which strati- 
fication may not add to the precision of the estimate. 


3a.10 Effects of Inaccuracies in Strata Sizes 


It sometimes happens that the knowledge of strata sizes though 
available is not exact, as, for example, when it is based on old 
census data which are out of date for use in current surveys or 
when derived from a sample. Thus, in a study of farm facts, the 
Survey may be organized in two stages, first selecting a large sample 
for estimating the strata sizes and then a second sample out of the 
first for the main purpose of the survey. This latter procedure 
is called double sampling (Neyman, 1938). In this section, we 
Shall examine the effect of inaccuracies in strata sizes on the 
estimate of the mean and its variance. 


Case I. Strata Sizes Fixed 


Let p; denote the true but unknown weight of the i-th stratum 
and p;' tHe inaccurate weight which is known. The sample 
estimate of the population mean is then given by 


- k РЧ 
Est. y = Z pi», (91) 


For fixed pi”s, this is a biased estimate of the population mean, 
for, in general 


Py " 
Ре 
Showing that the estimate is biased by 
(93) 


k 
Zo — рі) Ум 


To obtain the sampling variance, we write 


k Pi’ E k 2 Ру 

V| юз]: је (рэ EÈ Ра) |: 
ais M i= іші 5 

Ре i Pr 
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І 
Me 
S 

5 
~ 
а 
E 

| 
~ 
= 
^x 


i | 
oe ші 
Рь 


k 
K (ron 
іші 
а Ninge 
= ва Mat se 05) 


The mean square error will be the sum of (95) and the square of 
the bias term in (93). We have 


[^ Asi 
фа (Dina?) - Done даво 


k 2 
F | (ри — р) sa) (96) 
i=1 4 


If the sample were selected by the method of simple random 
sampling without stratification, the mean square error of the 
estimate would be identical with its variance, and would be simply 


Pi 
ру Fy ) | 
k S EDT 
= Е 4 A р (ыг Ји) 
ізі 
k ПЕТЯ и а 116 А 
+ 2 рир Фи — In) Ду — Jn) 
ізбіті os 
px 
(94) 
For fixed p;"s the second term is clearly zero, and we are left with 


а me: 2 
У Ones = с” T (97) 
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Asn increases, (97) will decrease and so will the first term in (96). 
The bias term is, however, independent of the sample size. It 
follows that (96) may assume a larger value than (97) beyond 
a certain л, making stratified sampling less accurate than simple 
Tandom sampling. 


: An example will help to illustrate the point. Suppose that accord- 
ing to an agricultural census taken in an earlier year, 80% of the 
holdings were below 5 acres. Information for the current year 
is not available but we will assume that the percentage of hold- 
ings below 5 acres has increased to 85. Suppose, further, that 
we have selected a stratified sample of л holdings allocated in 
proportion to the known sizes of the two strata. Then, clearly, 
the sample estimate of the population mean of the character 
under study will be calculated from 


Ӯ, = "Вођа + 207, (98) 
The true population mean will, however, be 
Jg = 85 yy, + 15 Py, (99) 
The expected value of the estimate in (98) will be 
E (Po) = 80 Fy, + '20 In, (100) 
so that the sample estimate will be biased by 
(101) 


ЕО) — Fy = — 05 Fy, + "05%, 


The mean square error of the sample estim 
composed of two parts, as shown in (96). 
population is large so that finite multipliers can 
will be given by 


ate in (98) will be 
Assuming that the 
be ignored, this 


M.S.E. Gg) = } (8051 + -20 52) + {-05 On, — Pn)? (102) 
he unstratified method of 


If we had selected the sample by t 
nbiased 


Simple random sampling, the same estimate would be u 
and its variance given by Е 
S? 


— £ 
ik — 


1 (-858,24-155,°+ 85 Gs IH D (н?н) (103) 


Now, suppose, the actual values of Sy’, 62, Py, and Py, are as 


in the following table: 


ПР ee 
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Stratum S? Ум 
1 1 1 
2 1 2 
Then, 
M.S.E. (Pe) = 1 + +0025 (104) 
and 
Pea «ТЗ (105) 


n 


The table below gives the values of (104) and (105) for five dif- 
ferent values of n. 


n M.S.E. (у) V(s)us 
25 *0425 -04510 
50 “0225 -02255 

100 0125 "01128 
200 “0075 -00564 
400 “0050 “00282 


It will be seen that for small n, the actual mean square error 
of the stratified sample is smaller than that of the unstratified 
simple random sample, but the superiority is lost after л = 51. 
Witb a larger size of sample, the bias assumes still larger 
proportions. It must, however, be pointed out that the bias will 
not be known in practice and consequently the variance of the 
mean of a stratified sample will continue to be estimated by the 
first term in (96), thereby under-estimating the variance. 


Case П. Double Sampling 


We shall now consider the case of double sampling where p;'s 
are estimated from a preliminary simple random sample and the 
character under study is observed on a sub-sample selected from 
the preliminary sample. Let Q denote the size of the preliminary 
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sample, 
ple, О; the number of units in the j-th stratum and л; the size 


of the sub-sample chosen out of О; (i = 1, 2, ..., К). We have 
Est. р, = 0, 
апа TM 
i 
An E 


ib i e pin i 
he estimate X pi'ya, 15 NOW clearly the unbiased estimate of 


= У а 
Он. For, from (92), we have 


„Ж p т 
Е dns.) = ве (dv «| ) 
рь 


NP (106) 


all that we need do is to 
n the right-hand side 
We write 


T i ; 

CUR и the variance of the estimate, 

н py E expected value of the terms O 
) for variation in р;”ѕ in repeated samples of 0. 


ЖЕЛІ 
on к) өт 


е right-hand side, we have 


On А 
expanding the second term on th 


k k 
y i D М- = тез LE ар NE. 
(Хн) = ELS n ae Г на 


Е 
РО О (109) 
ізбігі 
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Now from (71), (78) and (98) of Chapter II, we have 


EW) = рё + 00 BO РЈ (109) 

Ep; =p) = у= "0Р0 (110) 
and 

Ep =P) (ру — b) = — 9 g Ph ап) 


On substituting from (109), (110) and (111) in (108), we have 


k k 
” N—Q p(l—p)\Ni—n 
== ° i E 2 
y (X pi м) 2, б yg в, 
En 


ізі 


1 S 
N= 1 о PPP uu; 


б (5ө».) m C + 8-4 <<) кес s; 


illi 


N-Q1w ,. 
troio У а (113) 


іші 


The term involving the differences between the strata means on 
the right-hand side of (113) can also be written directly from 
(107). For, the second term in (107) clearly represents the variance 
of the mean of a simple random sample of size Q drawn from 
a population of size N divided into k classes with all the М; values 
in the i-th class equal to jy, each, (i = 1,2, ..., К), so that 
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5 5 (p/ — Ps) ЈЕ 2 з} 


X QN PILIS — Jy)? 


k 
М--0 1 = ы 
== о ) Ра Gn, — Ўл)? 
іші 


Now, (113) сап be rewritten as 


V (X E = 2! (+ — x) peSe 
wor 5 |ў (1- x) р.(1 — р) 52 
x 4 ЕС »| (114) 


The first term on the right-hand side is clearly the variance of 
the mean of a stratified sample when the strata sizes are known. 
The effect of determining the strata sizes from a sample is thus 
to increase the variance of the estimate by 


РЕТ ЖЕ – p) a0 082 + Ж Фа = 


(115) 


When finite multipliers сап Бе ignored, the effect is to increase 
the variance by approximately 


EUN (116) 
Q 
where 


S En Он, — ў)? (117) 


Since the first term in (115) will be small relative to the second. 
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Compared to simple random sampling, on the other hand. the 
procedure will lead to gain in precision. Thus, for the case when 
т, is proportional to p;S;, and ignoring finite multipliers, (114) 
reduces to 


(b) -e- 2129) 
a (Es) (Les) +9 


=! (2) + g^ (118) 


If the sample were chosen by the procedure of simple random 
sampling without stratification, the variance of the estimate 
would be 


TRUE Е 


N mn 
A DS? + а,“ 
ga EE e. (119) 
The reduction in variance is, therefore, given by 
k 
Ив Ум = p J ASS +(; – ы (00) 
іші 


where the letters ‘ds’ stand for double sampling and, as before, | 

= k 

Sw = PS; If О is large relative to n, the reduction in 
=1 


variance will approximate to the difference between the variances 
of an unstratified simple random sample and that of a stratified 
sample with Neyman allocation (vide Section 3a.5). In other 
words, when О is large, the procedure of double sampling will 
be approximately equivalent to stratified sampling when strata 
sizes are accurately known. 

This comparison of the precision of sin 


gle with double sampling 
regardless of the cost of the survey is n 


ot, however, of practical 
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value. What is of interest is to know whether the reduction in 
variance would be worth the extra expenditure on the preliminary 
sample. Alternatively, we can consider the problem as one of 
choosing, for a fixed cost, say Cy, between two procedures of 
sampling, namely, (i) a single sample drawn by the procedure of 
simple random sampling without stratification, and (ii) double 
sampling. The problem clearly envisages three steps, namely, 
(a) determining the optimum design for each of the two procedures 
of sampling, (b) obtaining the variances of the estimates for the 
optimum designs, and (c) comparison of the two variances. 


We shall consider the simplest case in which the cost of each 
sample is proportional to its size so that the total cost of the 
survey, using the procedure of double sampling, is represented by 


aO + с (Zn) (121) 


and that of a single sample of size, say n’, drawn by procedure 
(i) by 
(122) 


Con! 
where c, is the cost per unit of the preliminary sample and сг 
the cost per unit for the main sample.’ Obviously c, will be 
smaller than cs, for unless this were so, а preliminary sample would 
be ruled out altogether. For procedure (i) the optimum design 
is clearly the one for which the size of the single sample is 
given by 
„_ А 

Cg 
The variance of the estimate based on a single sample is, therefore, 
given by 


п 


= 
с. 
2 g? (123) 
С, 
ү о 


Ste, (124) 
С 
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The variance of the estimate based on the procedure of double 
sampling given in (114) depends on О and т; (= 1,2, ..., K). 
The optimum values of О and 7; are those for which this variance 
is minimum for given C,. Owing to the complexity of the 
algebra, we shall attempt only an approximate solution by ignoring 
all finite multipliers and assuming that 


P. — р) 
Q 


is negligible in comparison with Pj. Hence (114) reduces to 


k k 
" 26.2 2 
у (De) ~ У ис. 28 % (125) 
іті ізі 
This is to be minimized subject to the condition 
k 
Co = а0 +а Хп, (126) 


Using the method given in the 
function ¢ given by 


k * 
262 2 
ф- ) E +9 ts c +в ) n) (127) 
іші іші 


and note that the right-hand side can be written as 
k 


= 3 (Ese Е не) + g + као 


ізі 


previous sections, we form the 


k 


& FA oo isi ic) + (5 = моу 


іші 


к 
uL DE (128) 

ic 

so that for optimum values ў 

_ PS: Tm 
n, — М @=1,2,...‚®) (129) 
апа 
о 


= ie, (130) 


—_ 5... 
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The value of и will depend upon whether the variance is 
minimized for fixed cost or the cost is minimized for fixed variance. 
In the former case, which is the one under consideration here, 
we obtain from (126), (129) and (130) 


um ЈЕ 
Cy = а Je x, 5; 
0 NE сь + p E P 


giving us 
1 Go (131) 
УЕ ( / /a FS 
Ма d v а 2 Pi i 
Hence 
п, = PSs А 8 (132) 
i М/с» - — + 
id (va ey + Ve 2 PS.) 
іші 
апа 
0 =-% Co (133) 


2 е Е 
Уа (Уа + ус X Р8,) 
іші 
Substituting for О and т in the expression for the variance given 
in (125), we have 
=~ A 2 
k (Ма oy + Мс» 2 pS.) 
Ё (2 Р; м) = = С, 
Now, for a fixed cost, double sampling would lead to higher 
precision if (134) is less than (124), that is, if 


ғ ШЕН. 2 
(Ма oy + Мв У 9) с, 
іші < S? 2 


(134) 


С; C, 
Le., if 
Меј оь + Мез Š pS: <5 Мс» 
іші 
ке» if 
к ° 
(5 - Хрі5; (135) 
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where (s — Хр!) oe will usually be a positive proper fraction 
for efficient cysteine of stratification. 
Taking, for illustration, the case when 
S,=S, =1(=1,2,...,0, о =2 


so that 


equation (135) reduces to 


> m3 

a< 4 С; = ge 
In other words, if the cost per unit of surveying the preliminary 
sample is less than one-third the cost per unit of surveying 
the main sample, double sampling would be a more efficient 


procedure to adopt than simple random sampling without strati- 
fication. 


Example 3.1 


Table 3.1 presents the summary of data for complete census 
of all the 340 villages in Ghaziabad Subdivision. The villages 
were stratified by size of their agricultural area into four strata 
as shown in col. 2 of Table 3.1. The numbers of villages in the 
different strata are given in col. 3. The population values of the 
strata means for the area under wheat (9,0 and those of the 
standard deviations for the area under wheat (8, and for the 
agricultural area (ба) are given in the subsequent columns. 


Calculate the sampling variance of the estimated area under 
wheat for a sample of 34 villages: 


(1) if the villages are selected by the method 


i of simple random 
sampling without stratification ; 


(2) if the villages are selected by the method of simple random 
sampling within each stratum, and allocated in proportion 
to (i) the sizes of the strata (Ni), (ii) the products М;5,,, 
and (iii) the products М;54.. 


——— nail, 
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TABLE 3.1 


Strata Means and Standard Deviations of Areas for Villages 
in Ghaziabad Subdivision 


Stratum Size of А 5 
Village іп Ni Yu; Spi S, 
Number Bighas* 4 і г oi 
(1) (2) (3) ©] (5) (6) 
1 0- 500 63 112-1 56-3 129-6 
2 501-1500 199 276-7 116-4 267-0 
3 1501-2500 53 558-1 186-0 276-1 
a >2500 25 960-1 361-3 982-2 


* | Bigha — $ acre. 
l. Simple Random Sampling Without Stratification 
We have from (29) 


S — wo p (М, — 1) S? + li Ni Oni | 


i k k k (È Va 
% ° к ер. = -- 

= al? NS,— ) 5. + ) Nw, N 

іші іші 4-1 


The relevant calculations аге shown in Table 3.2. On substi- 
tuting, we obtain 


s? = x [7994000 — 182000 + 55577000 — 39372000] 


= 70850 


122 


SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


00912 08001 00957 0002$ 00/11 000664. 098181 Ore 
ozee 09<ғс 2-286 0£06 000сРОЄС 0002 1-096 000Р9СЕ 075061: 6-196 [74 Р 
099 091 T-9LT 0986 00060591 08562 1.856 000РЄ8Т 00ЭРЕ 0-981 ES € 
00101 ogres 0-L9c O9TET 000<Е251 0905$ 1:912 0009697 OSSET РоП 661 © 
Ору 0918 9-671 085 000164. 0904 EZI 000002 OLIE Є-9$ £9 I 

(с) ар (01) (6) (8) (D (9) (9 (p) (е) (0) (D 

to, 
S D t H D D лодшт, 

ELM "SN "в “зн с we is жы 576 "S 'N amens 


Гс ojdumxg ш гоштард Зицашро ays fo иоцпутоо> 


T'E TIIYL 
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Hence 
А М—"п 1 
V Cy. im dt V GP 
Gus N n 5 
_ 306 
= зз X 70890 
=з 1875 
2. (i) Proportional Allocation 
We have 
k 
= М-п 
YO EA. мен 5 2 
02» = уы y мө, 
i=1 
306 
=ске? 0 
(340)? x 34 х 7994000 


= 622 


2. (ii) Neyman Allocation 


The allocation of the sample to the different strata will be in 
proportion to N;Sy, shown in col. 9 of Table 3.2. On substituting 


in (22), we get 


k 2 k 
y 1 = 1. 2 
У Own = Nn (Ул) м? ), Ми 
іші іш 
— ЕШ — 7994000 | 
34 
1 
кеб же — 7994000 
115600 [61158000 — 7 ] 
— 460 
2. (їй) Allocation Proportional to Мба; 


_The allocation of the sample will be proportional to Nj 
given in col. 11 of Table 3.2. On substituting 


Sa, 


nus nN;S,, 
ое 7A _ Ñ 
У NS, 
ј=1 


in the formula for the variance of ће mean of a stratified sample 


given in (7), we get 
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= бару (4 х (100480) (21600) = 7994000] 


1 
= 115600 ~ 55840000 
= 483 


Now the relative efficiency of any given procedure of sampling 
(B) compared to that of a standard procedure (A) for the same size 


of sample is defined as the ratio of the inverse of their variances. 
Thus 


1 
Ve Р 
Б. жа ЕВ = 5. 
Е: Vs 
Va 


The values of the variance obtained above for the different 
procedures of sampling together with those of their relative effici- 
encies as compared with unstratified simple random sampling, are 
given in Table 3.3. The values of the relative efficiencies com- 
pared to proportional sampling are also shown in the table. 


TABLE 3.3 


Relative Efficiencies of Different Methods 
of Stratified Sampling Р 


р " R.E. compared R.E. compared 
Method of Sampling hnc to Unstratified 10 Proportional 
Bus Sampling Sampling 


Unstratified imple ао 
Sampling 1875 


" Stratified: 


(i) Proportional .. ae 622 301% 
(ii) Neymar. ate 25 460 4087; 135% 
ili) mec М8, .. » 483 


388% 129%, 
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Tt will be seen that stratified sampling reduces the sampling 
variance to nearly one-third its value in unstratified simple random 
sampling. Further, Neyman allocation improves the precision 
as compared to proportional allocation. The allocation of the 
sample in accordance with the Neyman principle as applied to a 
correlated character is seen to be almost as effective in improving 
the precision as the Neyman method applied to the character 
under study. 


Example 3.2 


A yield survey on paddy was carried out in Kegalle District 
(Ceylon) in Maha 1951-52 (Koshal, 1953). Twenty-eight villages 
were selected, distributed in the various strata approximately in 
proportion to the acreage under paddy. Three plots of 1/80 acre 
were harvested in each village. The values of the means and the 
mean squares of the village means for the different strata are given 
in Table 3.4. Obtain the estimate of the district mean yield by 
combining the strata means in proportion to the number of 
villages in the strata. Calculate its variance and hence estimate 
the efficiency of stratification as compared to unstratified simple 
random sampling, treating the village means as the true means 
of the respective villages. 


TABLE 3.4 


Crop-Cutting Experiments on Paddy, Kegalle District (Ceylon), 
Maha 1951-52 


Means and Mean Squares of Village Mean Yields per Plot 


Sune Ni "n (02 blot (охоо? 
1 189 5 369 4330-9 
2 242 7 301 14812:4 
3 146 3 368 17309-0 
4 178 3 171 1658-5 
5 287 10 305 3452-7 
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The relevant calculations are given in Table 3.5. From col. 5 
we have 


Est. y = Je = 301-6 


To obtain the variance of Jw, we substitute in (57) and obtain 


k E 
E pj p 25 2 
t. = Бо а Ра Si 
Est. Уб = ) PE — У ү 
іші іші 
А = 298-23 — 7-57 
= 290-66 


From (72), taking N/(N — 1) = 1, we obtain 


k k k 2 
= Ју > Ж 5 
Est. V (Y,)us = Nn ја Ps? + je = 2 рды 
іші іші Ü 
k 


1 1 
= (55 = joa) | 7885 -+95300—90960 —1646 298 


= (0-034755) (10900) 


= 379 
Непсе 
А TEM V (5, 
Efficiency of stratification = ——-^'Us 
У Qs 
_ 379 
291 


1-30 or 130% 
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TABLE 3.5 
Crop-Cutting Experiments on Paddy, Kegalle District, 
Maha 1951-52 


Calculation of the District Mean Yield and its Variance 


Stratum Mi т Уа Pi Рум Din? 5° 
N 
umber (j (2) зу  ($ (у=(3)х@ O=@xH) (0 


1 189 5 369 -181382 66-9 24700 4330-9 

2 242 7 301 -232246 69-9 21000 14812-4 

3 146 3 368 140115 51-6 19000 17309-0 

4 178 3 171 170825 29-2 5000 1658-5 

5 287 10 305 -275432 84-0 25600 3452-7 

1042 28 301-6 95300 

EMEN ee O 

à "n pisi pes? 

Stratum pis? pes? раб. a ae 

Number m $ 2 
(9)+(2) (12) (9) +) 


1 785-55 142-48 157-1 28-50 0-754 
2 3440-12 798-95 491-4 114-14 3:301 
Б 2425-25 339-81 808-4 13:27 2:327 
4 283-31 48-40 94-4 16-13 0-272 
3 950.98 261-93 95-1 26:19 0.913 

7885-21 1646-4 298-23 7:567 


B. VARYING PROBABILITIES OF SELECTION 
ariance 


3b Estimate of the Population Mean and its Sampling У. 
Lastly we shall consider the theory of stratified sampling when 
Sampling units within a stratum are selected with replacement 
With Varying probabilities of selection. Let 
Рибе 1,2, „Му PHL 3 «59 


5 B 
E the selection probability assigned to th "E 
h stratum, Clearly, then, in virtue of the resu 


2, 2n, defined by 


e j-th unit in the 
2 in Section 
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ni 
1 2 
ГА H 


ni 


Im 
n ü 
j 


a 


will provide 
i-th stratum, namely, yy; and 
given by 

Уб) = == 


where 
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(136) 


an unbiased estimate of the population mean for the 


its sampling variance will be 


(137) 


(138) 


(139) 


(140) 


An estimate of the population mean y; will be the weighted 


mean Zy given by 


2 1 = 
Zo = N М,, 
іші 
1 и 
= p, (141) 
i=1 
which is easily seen to provide an unbiased estimate of y,. For, 


Е (2) = 3 PE Gq) 


k 1 ni 
- Pah Ër 
іші і 


2 Раў Ni 


it m 


(142) 
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To obtain the sampling variance of Zw, we һауе 


V (&) = EE, — E Eo)? 


rk 2 
= E [5p Cu 52] 
k k р 
= Е { Хра (Zn; =u) F У рр, Си — Ж.) 
іші ізетті 
X Сл 2,)} 
Ч k 
= Ўро, њу + Z opoe E Gu — 2) 
di XE бре — e.) 


since samples are drawn independently from the i-th and the i’-th 
Strata. Hence, 


уа) = Yipee (143) 


in virtue of (137). 


Using Section 20.3, an estimate of V (Zw) will be provided by 


(144) 


where 5? denotes the mean square between z's in the sample for 


the i-th stratum, and is defined by 


ni 
s = 1) бе 79 
s n,— 1 

i 


3b.2 Allocation of Sample among Different Strata 

The variance of the estimate, apart from the Р opulauon Mo 
stants р; and ciz, is seen to depend upon the o ы ү 
sample among the different strata. The cost О dr MA 
likewise depend upon the values of n; The OR M SEE AO 
mining the optimum values of т, as stated in Section 24-4» 


9 
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maximize the precision for given cost or minimize the cost for 
given precision. We shall illustrate the principle for the simple 
case for which the cost of the survey is represented by 


С = È ст (145) 
ісі 


where с; is the cost of collecting the information per unit in the 
i-th stratum. 


Let à 
$ = Ve) + uc 


k М k 
2 Cis 
= ) ге = ње 2 Cil; 
іші d 4=1 


where p is a constant. 


Clearly V (Zu) is minimum for fixed cost, say Cy, or C is 
minimum for fixed variance, say Ур, when ф is a minimum. Now 
$ can be written as 


pes ^ is — veen р d ЭУ ром, (146) 
іші 


іші 
It follows that ¢ is minimum when each of the square terms оп the 
right-hand side of (146) is zero, or in other words 


1_ „Рав 


LU Ve Ме (@=1, 2, ..., k) (147) 


The constant of proportionality 1/4/,4 is determined so as to 
satisfy the condition of fixed cost or fixed variance. In the 
former case, we substitute for n; from (147) in (145) and obtain 


Co 


1 
A 3 = 148 
ve РА Реса V 6; (148) 
Hence, from (147), we get 
Рада Co 
(149) 


a ма (2 Раба уа) 


which is seen to be identical in form with (13) in Section 34.3. 


* 
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When c; — c, or, in other words, when the total size of sample 
is fixed, the optimum allocation is given by 


n = п REMO. (150) 
Ж рб: 
іші 


For the alternative approach іп which the cost is minimized 
for given precision, the reader may verify that the value of n; is 
given by 

k 
2 ром: 
п; Digis , {=1 = 
ы Ме, Vo 
where V, stands for the value of the variance with which it is 
desired to estimate the mean. Comparing (151) with (17), we 
notice that the optimum allocation is governed by the same 
considerations as those mentioned in Section 3a.3 for simple 
random sampling. 
3b.3 Variance of the Estimate under (i) Optimum Allocation, 
and (ii) Proportional Allocation when the Total Size of 
Sample is Fixed 


= (151) 


For п fixed, the optimum allocation is given by (150). Subst 
tuting for л; from (150) in (143), we get 
2 (152) 


V (Zon =} [2 peu] 


For proportional allocation we substitute п; = "pi in (143) and 
obtain 


У (2y)p - 3 peut (153) 
П iz, 
Now (153) can be expressed as 
уа, = (Еро) + È nente] КЕ 
where 
(155) 


k 
бәз = 2 Piti 
iei 


It follows that the efficiency of optimum over proportional alloca- 
tion is due wholly to the variation among the strata standard 
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deviations. If the ciz are all equal, the two systems of allocation 
become equally efficient. 


3b.4 Comparison of Stratified with Unstratified Sampling 


If a sample of п is selected as an unstratified sample with 
replacement with selection probabilities P; (/ = 1, 2, ..., М), then 
we have seen that 


TENA 
^ n 1 


A^ Nn 
=; NP, (156) 


provides an unbiased estimate of the population mean Әр and that 
its sampling variance is given by 


2 . 

ув) = = (157) 

where 
N 

го == P — 2, ) (158) 
апа 

= N = 

2. -E Py, = ўқ (159) 


If the sample of л is selected as a stratified sample with ш 
units from the i-th stratum with selection probabilities propor- 
tional to P; given by 


Ру = = (7=1,2,..., N) 160 
Бр, (160) 
Ni т 
where 27 denotes summation over the Мұ units in the i-th stratum, 
then 
5 k = 
A =2 Ру (161) 
provides an unbiased estimate of ),, with its sampling variance 
given by 


k 
> 2 oj," 
И (2) = à,» * (162) 
iei 
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where 
Ni = Ж 
= E Py; (Zy — 24. » (163) 


For purposes of comparing (157) and (162), we note that 
= Ри, (164) 


р, XP, (165) 


_ Pi (166) 
We may, therefore, write 


[m 
6; 


Р, (@ — 2.) 


| 
Г 


| 
[4 
Y 
M: 
~ 
тт 
ads 
E 
| 
tr 
S 


п 
н 


= 


Т 
Li 
т 


І 
ы 
M 
iM: 
M 
D m 
NE 
A 
| 
^U 
N 
Nims. 
Шы 
"s 
E 
| 
Ку 
> 
6 
SA 


т 
1 
її 


р, 


4. 


маға (раа) (167) 


І 
i ja ї 
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Hence, from (157) and (162), we get 


k k 
"ЖИ 1 1 1 Em {3 
Vys — Vs = Di Sis Яр. zs Р,, P. қ-а.) 
іші f i іші Ё 


(168) 


Now the first term іп (168) сап be positive, zero or negative. It 


is zero when the sample is so allocated among the different strata 
that 


either 
m ос Р, 
Ni ` 
=п ЕР, (169) 
ог 
а, 2 
nm oc P ре 
2 vu 
iui gm 
== (ар > (170) 


The variance of an unstratified sample in this case is reduced by 


k 


I 53 Б 2 
Ws Ys m Pe (B 2—2.) (171) 


ii 


The first term in (168) is positive when the sample is allocated 


according to the Neyman principle of allocation in (150). We 
have for this case 


Bg perg |} UR- - 2) 
ta ра р, (2-2 J1 (172) 


The efficiency of a stratified sample will decrease as the alloca- 
tion will depart from the Neyman principle and a point may be 


STRATIFIED SAMPLING 135 


reached where the first term in (168) will not only be negative 
but larger in magnitude than the second term, thus making an 
unstratified sample more efficient than a stratified sample. 


For the special case when Pi, = pi, (168) takes the form 
5 і зй 
(Vus e Vs}. =p = Е Pe в? (55 = =) 
i=1 


+. аш-ау 0m 


and in addition when the allocation is optimum, in accordance 
with the Neyman principle, the reduction in variance is given by 


{Vus — Ёмрь=юә = 1 p pi (ei: | 
= | | 
+ |} р. E. rt (174) 


3b.5 Evaluation of the Change in Variance due to Stratification 
from a given stratified sample, 


In this section we shall evaluate, 
e of an unstratified and a 


the difference between the уапапс 
stratified sample. We write from (168) 


к 
ey in)! 1 
Est. {Vus — Vs} = үз pe би (= = xj 
iei 


Now, from (137), 
VE) = EGY — tr = se 


whence 


k k 9 
piis DOG ME Ж pi Фа (176) 
Эң: 
іші 


4. 


ізі 
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Also, from (143), 
VG) =E (2,2) —22 = Die fax 


= 


whence 


Ей. 22 „зл — ра. бе (177) 


$ а бш (1 
DP: к Р. 1) (178) 


іші 
On substituting from (178) in (175), we, therefore, obtain 
k 


Est. (Vj; — V = г pe 6,2 ( 1 э 


ПР, п; 


(179 
When Р; = pi, we get ) 


k 
| ° ) б? 
Est. {Fus — ы. =н) == Pi бы pè > 
п ; | 1 


img i=1 
MI п 
hs ue. 3 1 ul 
z n |} Pi (Zn vl mi У^ (1—р) Шы 
.0% ы i=1 * 
(180) 


which is seen to be identical in form with 


(73) aft i 
large in the latter. ) after making М 
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CHAPTER IV 


RATIO METHOD OF ESTIMATION 


A. SAMPLING WITH EQUAL PROBABILITY 
OF SELECTION 


4a.1 Introduction 


In developing the theory of simple random sampling in the 
preceding chapters, we have considered only estimates based on 
simple arithmetic means of the observed values in the sample. 
In this and the next chapter, we shall consider other methods of 
estimation which make use of the ancillary information and which, 
under certain conditions, give more reliable estimates of the 
population values than those based on the simple averages. Two 
of these methods are of particular importance. They are: (i) the 
ratio method of estimation, and (ii) the regression method of 
estimation. In this chapter we shall consider the former. 


4a.2 Notation and Definition of the Ratio Estimate 
Let 


Ji denote the value of the variate under 
study for the i-th unit of the 
population, 

% 


i the value of the ancillary variate on 
the same unit, 


the total value of the y variate in 
the population, 


the total value of the x variate in 
the population, 


the ratio of y to x for the i-th unit, 


(——————————— A 


———- -- 


CL SERI = Eme... ^ аа --- 


— 


TÁ- >> 


~ 
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N 
А 1 ; : 5 
ty 7 25 rj the simple arithmetic mean of the 
nis ratios for all units in the popu- 
lation, 1 
T 1 : ; 5 
n = [Л the simple arithmetic mean of the 


ratios for the units in the sample, 


Ry = d = 4 the ratio of the population mean of 
i y to the population mean of x, 
and 
Po а Ху, the corresponding ratio for the 
» Х ! sample. (1) 


Ку is said to provide an estimate of the population ratio Ry, 
and the product of Rp with X, given by 


PS = Ra XY (2) 
provides an estimate of the total value Y in the population. The 
and 


estimate is known as the ratio estimate of the population total 
its use presupposes the knowledge of X, the population total of x. 


To take an example, у may denote the number of bullocks 


Оп a holding and x its area, the ratio Rn giving an estimate of the 
number of bullocks per acre of a holding inthe population. The 
Product of R, with the total acreage of the holdings gives an 
estimate of the bullock population on the holdings. Or, again, У 
апа x may denote the values of the character under study in two 
Successive periods, e.g., the acreage under a crop during the 
Current and the census years. It will be shown 1n a subsequent 
section that by taking advantage of the correlation between y and 
x, the ratio method, under certain conditions, provides a more 
Teliable estimate of the population value than the comparable 
estimate based on the simple arithmetic mean. 


4а.3 Expected Value of the Ratio Estimate 


At the outset it will be noticed that, unlike 
Оп the simple arithmetic mean, in a ratio esti 


e the estimate based 
mate the numerator 
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and the denominator are both random variables. The derivation 
of the expected value of Ку, therefore, presents difficulties. 


Let 

y: = Yu + ei 
so that 

Y. = Ун +é, (3) 
where 

E(é,)=0 and E(é,2) = von S (4) 
Similarly, let 

х= Ху + е 
so that 

Xn == Xn ЕЕ M (5) 
where 

Еби) S and Е (e,'?) == Мел аг (6) 

п 


р То obtain the expected value of R,, it is convenient to express 
it in terms of e and е. We have, taking expectations, 


(192 
мн se () 


We shall now suppose that the x's are positive and л is suffici- 
ently large, so that 


“ 
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Expanding* then 


ав a series in су, we have 


E(R) = Ry FLL +e =F 4 ix. = в 
(R,) N ул РА NEA 
С ы CER NEC ITA 
ју Хе Хе ic Хи! yy X) +} ®© 


Further, we shall assume that the contribution of terms involving 
powers in г, and e, higher than the second to the value of E (Ку) 
is negligible, being of the order of l/n" where v > 1. Denoting 
to a first approximation the expected value of Ку by E, (Ку) 


we may write 


RR) = Rye {1 + 8 — + 3 – um 9) 


* In a paper read before the meeting of the International Statistical 
Institute, 1951, J. C. Koop justified the expansion by using an ingenious device. 
He wrote 


n N N-n 
у= Уу у 
so that 
ny, = Nyy — (N — n) Ум-п 
Similarly 
пХу = Niy — (N — п) Xy-n 
Непсе 


ЖЕН ( Ue yo” жа) (1 Nx n Bey" 


Clearly, when the x’s are positive 

М-п EZ 

=—& сна 1 

| NC x 
Expanding 

Nn Xy 
(1 (NO 22) 
term by term, he reached the 


by Taylor's theorem and working out expectation a 1 , 
same expression for the expected value of R, as given in this and the next 


section. 
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Now 


Еби) = a5 Е (2) (2«)] 


= E 2 + У] 


ij 


N 
_ Л п и n (n — 1) А 
= р Ж + N(N —1) x 


іші ij=1 


in virtue of (47) of Section 2a.9, 


N 

CS (п — 1) 

= aNLa + aN(N — 1) (2; “) de) 
= i=1 іші 


—CUN Q9 (10) 


where p is the coefficient of correlation between y and x, given by 


E (y, — yy) (x; — Xy) 
~ МЕ(у, рь) E Ga х) (11) 


Substituting from (4), (6) and (10) in (9), we get 


— 2 
BR) = Ву fit (8 — SE} 
N NAN 
= Ry { 


со) (12) 


4% 
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апа 


We notice that the expected value of R, is not the population 
value R,,, showing that Ё is a biased estimate of Ry. Denoting 
to a first approximation the relative bias in a ratio estimate by 
В\, we have 


niue = Ма (саса) (13) 
М 


shes Cy = Cy = C, the expression for B, simplifies and we get 


for large М, 


im с (1 — р) 
Thus for С° = 0-8, р = 0:6 and n = 10, the bias is seen to be 
а little over three wer cent. of the population value of the ratio. 
~ he bias decreases as /1 increases, showing that the ratio estimate 
I$ à consistent estimate. * For large and p sufficiently high, E: 
bias will usually be negligible. The bias vanishes altogether when 


Са = pC,C, 


Ју =p = Хи 
| i f y on 
| ober words, the bias vanishes when DU Еа oct 
x i i igin. , > 

а straight line through the oris! Ne т. 


general. For, let the regression of y on 7 
кыш lation, We 
its 1 ation, 
бе both sides of (14) over all units 1n the popu 5 
Obtain ; 
Jy = В, 
Er lied to those estimates 


Eu к оп value: 


robability to the populati 


i n 
from „Тһе expression ‘consistent estimate’ 15 
infinite populations which converge 10 р 
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or, in other words, 

B=Ry 0 - (15) 
It follows that 


2 у, 
E(R,) = ЕЕ (: 


? 
IE 
Xn 


=В 
= Ry 


An important point concerning the magnitude of the bias which 
will be clear later on is that, in large samples, the bias in a ratio 
estimate is negligible as compared with the standard error of the 
estimate. 


4a.4* Second Approximation to the Expected Value of the Ratio 
Estimate 


In deriving the expected value of Rp in the previous section, 
we assumed that the contribution of terms involving powers in г, 
and гъ’ higher than the second is negligible. We shall now retain the 
terms in г, and га up to and including degree four, and proceed 
to obtain a better approximation to the expected value of Ry. 


Taking expectations term by term in (8), and using Е,(В,) 
to denote the second approximation to the expected value of 
Rn, we write 


Е(г„?) __Е(елги) , Е(г„г„?) 
R, R [| ASS пеп £ n s. 
Fa (R,) У + Hy? P3091 Pool" 


E (e°) | Е (é,'4) 6) Е) 


Hor? | ші" оро? tS) 


where шо = Jy and іо = Xy. The evaluation of the terms on 
the right-hand side involves heavy algebra which is best dealt with 
by the method of bi-partitional functions. The relevant formule 
have been tabulated by the author (1944) and reproduced in the 


— —— 
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Appendix to this chapter. Using then (6), (10) and the formule 
in the Appendix, and writing in terms of the moment notation 
given by 


1 N 


іші 
we һауе 


ы ї{ =й 1 Bye P 
E: (R) = Ry |і + NHI я (5 po 


(М —n)(N — 2л) 1 Ма _ іш) 


(М-ІИ(М-2) m Vua? — ta? 


T NIU P 


Ко _ а.) 
Hoi [o 


N(N—n)(N-—n-— D 3(n—1) 
*t(W-0)N-2(-3) m 


«(mo me) 


Ho Hao? 


It is seen that the contribution of higher order terms depends, 
apart from n, on the values of the moments and product-moments 
of the two variates. To get an idea of its magnitude, we shall 
suppose that N is large and that, further, the population follows 
a bivariate normal distribution, so that 


а = 0 = Ha; Hos = 0 = Изо) Hos Зроз?; Has = Зил 
We then have 


l f Bos Ил 3 ко 
Е, (R) = Е, i+: g + ТЕ; 
з (Ry) ч Поп ваг Paolo п? uy? 
g ( Hon, к M )] 
Fou Ило!от 


Ry Е =F 1 (С = pC,C;) zi 2, с, (Cc? – 6,0) ] 


1 


преза 
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To a second approximation, the relative bias in a ratio estimate 


in samples of n from a large population can, therefore, be 
expressed as 


в, = B. (1 4 = СА) (19) 


Equation (19) shows that the contribution of the third and fourth 
degree terms to the relative bias of a ratio estimate is 3C;?/n 
times the value of the latter to a first approximation. Unless 
n is small, the contribution can, therefore, be considered to be 
negligible. С. R. Ayachit (1953) has assessed the value of contri- 
butions to the bias from successive approximations by means of 
experimental sampling on a wide range of populations commonly 
met with in surveys, and found that the contribution of higher 
order terms is negligible. For appreciably large п, say 30 or 
larger, even the leading term is found to be of no consequence. 


4a.5 Variance of the Ratio Estimate 
By definition, 
V (R) = E(R, — E(RJ)P (20) 
From (8) we write to a first approximation 


2 


R,=Ry+R (s: -%) (%; — Bu 
қ м + Ry SR IIS + Ry Xy Jy (21) 


Хи 


Hence, substituting from (21) and (12) in (20), we have on 
expanding and retaining terms up to the second degree, 


E (R, — E (R,)}* = ЕЕ e A ay 


2% = /2 ssh 
= КРБ (s; 43 - $e) Q2) 


N ЎмХм 
Let V, denote the variance of a ratio estimate to a first approxi- 
mation. Taking the expectations term by term, we obtain 


ЈУ —й 1 
V, (Rn) = К° ay : а fe per 20,0.) (23) 
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or 


R N—n 1 $3 8 
и ( 3 -M (б - C2 — 266; (24) 


the expression for the relative variance takes the form 
Ку _ Мп оз 25 
V, (ғ) = 5 2001-0) 05) 


and, for large М, 


= 2 (relative bias) (26) 


Comparing (26) with (13), we see that the bias in a ratio 
estimate is of the order 1/n and hence negligible, for n sufficiently 
large, as compared to its standard error which is of the order 
Пуп. 

To obtain the variance of the estimate of the total, namely, 
Ya, we multiply (23) by М?Х,2, so that 


VY) = NS үз (С, + Cè — 2pC,C,) (21) 
which can also be alternatively written as 


= NWR (s, + дува — 28 (SS. 


N? (V (5) + В? (Sq) — 2Ry COV Фа Хо) (28) 


| 


An alternative expression for the variance of the ratio estimate 
which, in some ways, is more instructive can be obtained by 
expanding the expression within brackets in (27). We have `` 
+ SA TSS ` 


Бе + 


2 2 2020 У 
т ў Ape Ga Ум" Хи“ УмХн 
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yy? N—I 
52 2 хе — м) 
д. Jv „Маш... 
Хи N =l 
N Мр 
ээ, Cnm м) 
Хи Д == 


1 N N 
= a (V1) Doe "E 
i=1 


i-i 


N 
—2Ry Ж Уа 
1=1 


Hence, from (23) апа (27), 
N 1 1 1 З а 
=n 
WAR) = Fi i R Dn - | (30) 


WO = yey? ) Or — Reg? ө) 


апа 


If the population is regarded as divided into k classes with 
the N; units in the i-th class having the value x; each, (30) and 
(31) can be rewritten to read as follows: 


k Ni 


М--й 1 1 1 
My) = 1:55 eH Dy 0 Оо А) (0) 


ісі іші 


к Ni 
поа = Goa” a NOE res өз) 


апа 


іші ізі 
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Clearly, the term under the first summation is proportional to 
the variance of y for a fixed value of x when the regression of 
y on x is linear, and the regression line passes through the origin. 
For, in this case, we have from (14) and (15), 


Е (Yy | x) = Кух; 
Hence 
Ni Ni х, 
D (у, — Rux)? = X Wy — E Ом | x) 
ізі іші 


= М У (yy | х) 
or simply 


|| 


N: VG) 
We may, therefore, write (32) and (33) as follows: 


k 
ТИС "225 rr 34 
V, (R) = Nel m Eye N N; V (yy |х) (34) 
іші 
апа 
1 k 
yr = МО .1 v Уба | x 89 
іші 


It is, therefore, seen that the variance of a ratio estimate depends 
upon the relationship between the variance of y for a fixed x and 
the value of x. The situations of practical importance are those 
in which 


(а) У (у | х) = constant say у, or У (2 |х) = ~ (36a) 
(b) У (у | x) cc x say yx, or V E |х) =? (360) 
and 
› 2 (2 У = (36 с) 
(с) К (у | х) o x say ух?, ог у (2 |х) у 


On substitution in (34) and (35) we obtain the approximate values 
given on the next page for the variance of a ratio estimate 
appropriate to the above three cases. 
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V, (R) V, (Y3) 
Мел y N*(N —n) y 37 
(а) М—1' ni М—1 Сп Gra 
N-i A. М (Мп) vy. 37b 
(0) М—1'пх NE 4 Gre 
k А " 
gy =й у a NOD Y ) waa (37 c) 
М—1'п' Му N—I n Tu 


We will later on show that when V (у | x) © x, the ratio estimate 
is the best unbiased linear estimate for a given set of x's. 


4a.6 Estimate of the Variance of the Ratio Estimate 
Just as sy, Pn, Sx? and X, provide unbiased estimates of the 


corresponding population values, similarly sy. defined by 


20% — In) ба — €) 
п—1 


Sy = 


provides an unbiased estimate of the Corresponding population 
value pS Sy. If the sample means and variances are independently 
distributed as they will, for instance, be in samples of п from a 
normal bivariate population, an estimate of the relative variance 
of a ratio estimate will be given by 


Ку N-n Wr* se 2%, 
Est. V, ( z) = oh E H= : (38) 


On simplification in.the manner shown in the previous section, 
(38) reduces to 


AA. Жыя 1° т € " 
Est. У, (2) Ws "ўач Ж Ол = Rx) (39) 


We can thus take the estimates of the variances of В, and Y, to be 


Est. V; y e X Na ШЫ; h ^x L h Ол — Rx)? (40) 
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and 


ы gosse. = = 2 (у; — Rx)? (41) 


n 


The reader will note that these are biased estimates but the bias 
will be negligible if the coefficients of variation of y and x are small. 

One special case of a ratio estimate, for which the estimated 
variance takes a particularly simple form, may be mentioned. It 
is the case of a weighted mean in which the weights are in the 
nature of ancillary information, varying from one sampling unit 
to another, with y; of the form wim; and x; = wi, SO that 


was (42) 


The sample estimate of the variance for this case is given by 


Est. руб) = oa uut а-а (9) 


Ма Wy? Ame 


and 


Bi, quj e EH ЕТ ) еш 89 


п 


4a.7* Second Approximation to the Variance of the Ratio Estimate 


We shall first obtain an expression for the relative mean square 
error of Ку defined as 


et. 45) 
Ry? ( 

From (8), we have to a second approximation, i.e., neglecting the 
contribution of terms involving powers in é and én’ higher than 


the fourth, 


E (R, — Ку)“ 


А Е x ыш? 2 а РИС) 
К, — Ку г, ex + wr. €,€n Суба _ ©» 
— = = аа Е-Е zd 3 
Ry Py Xn Хм Оши Yw*w XN 
Burg ce 
кез а (46) 
Xy УмХм 
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On squaring both sides, expanding the right-hand side and retain- 
ing terms up to and including the fourth power in г, and гр 
and taking expectations, we obtain 


| 


| 
| 


° 292 2/3 
+ к (Mai? _ шы _ 2852) 
Џ 5 2 5 2= 3 
УнХн Ум Хм Хм 
Зе ер? 6e,2, * Зе“ 
LE (Gen en ы) (47) 
Ум Хи УмХн Xy 


We notice that the second approximation to the mean square 
error involves the addition of the last two terms in (47) to the 


expression as given in (22). Using the results from the Appen- 
dix, we obtain 


= (5А zany Em (A pmo d 
N—1 n Мао ka Мо 


4 2(N —n (N —2n) 1 


(N—D(N—2) т" 


x Jus _ ра __ Hos ) 
ТЕ Hao Poi Ho? 
3(N — п) (№ + № — 6nN + бл?) 


n® (N — 1) (N — 2) (М —3) 


Ha __ 215 Ра x 
Hao Hg? Hola? ы 
3(n — 1) М(М —n)(N —n —1) 
n®(N —1)(N—2)(N—3) | 


СЕЛ + pn? бироз Jug? 
. Pao Har? Pia? Ној“ 


If the population is large and follows the bivariate normal 
distribution, we have 


(48) 


Изу = Hai = Шз = Hos = 0 


Pos = 38,", шо = 38,4, изу = 35,56, pis = 375, 5,3 
раз = (1 + 2p?) 5,8,2 


{= 


| 3o 
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so that 


6 3 4 
- ESS? 384) 
пута Xn 


which can be expressed as 


RN 4302 Rn) 2б су 
ЛЕЛЕР (по) +5 C—O} 
(49) 
where V, (Ку Ку) denotes the relative variance of Ку to a first 
approximation in samples of 7 from a large population. 


For the finite population, the effect will be approximately to 
multiply V, (Кај Ки) by (N — n)/N. We may, therefore, write 


R, — Ry * = (#) 3 $ i5 (в) 
Е( Ry је Ry TRO UM 


2 
- — 2 50 
+ 2 (c, РС)" (50 
Now 
E (R, — Ry)? = E[R, — E (RP + [E (R,) — Rw? 

= V (R,) + (bias)? (51) 
Hence, deducting from (50) the square of the relative bias term 
given by (13), we obtain 


R R, Sos cL Бы 5G? (с рс)? 
V, (б) = р, ES +5 GA (5) dx (С, — РС) 
(52) 
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When Су = Cy = С, the expression simplifies to 


2c? (1 — 
һ(ж)= 2268 + 5 609 +51–а9 09 

Since (1 — p)? will usually be negligible as compared to 
(1 — о), we have 


(к) n) + ер. 6% 


It will Бе seen that the expression for the second approximation 
to the relative variance is related to the first approximation in the 
same way as the expression for the second approximation to the 
relative bias is related to the first. We conclude that, unless 
n is too small, the first approximation may be considered as 
adequate. The result is due to Cochran (1940). 


4a.8 Conditions for a Ratio Estimate to be the Best Unbiased 
Linear Estimate 


We shall now show that when (i) the relationship between the 
mean value of y for a given x is linear with x and passes through 
the origin, and (ii) the variance of y about this line is propor- 
tional to x, then, for a given set of x’s, the ratio estimate Y, gives 
the best unbiased linear estimate of the population total and its 
variance is given by 


AN —mÀ. I 


i uu EL oM E 
N пх, ту 


Let equation (14), namely, 
EQ | x) = £x (55) 
denote the regression of y on x passing through the origin, and 
И (у |x) = yx (56) 


denote the relationship between the variance of y for a given x, 
and x. We have seen in (15) that 8 in this case represents the 
population ratio Ry. 


It is known from Section 2a.3 that the best unbiased linear 
estimate is given by a corollary of the Markoff theorem on linear 
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estimation (Neyman and David, 1938). The method consists in 
setting up a linear function of observations as the estimate of В 
and minimizing its variance subject to the condition that the 
estimate is an unbiased estimate of 8. Suppose that the estimate 
of the population total Y is given by 


k 
Үр = 5 да) (57) 
where {һе у observations in the same class аге assumed to have 
equal weight and Aps are chosen by the application of the 
Markoff method. Now, the condition of unbiasedness gives 
k B o. 
E Za.) = Y = Мн (58) 
i=1 іші 
Substituting from (55), we obtain 
k k 
У АтВх; = X МВх 
іші іші 
ог 


Жад, й) е0 (59) 


Now the variance of У, for given 7, пә, ..., пк is given by 


š n 

B 8 ko %2|т 
V (Ys | m; ns ++ m) = E (Sandu Mi.) : 
i=1 =1 . 

ny 


k 
=E { 2 Ani? Oni — Fw)? 


1=1 


k = 
E Agna; Gs; — Эмд 0) – у] 
ізбіші 


= Snare У (1x) · (60) 


From (56), we have 


VQ |x =x) = ух, 
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i.e., 


N,—1 
NC S? = yx, 
where S? is the mean square for the i-th class. Hence 


N,—n А 1 2 
М, ny 


| 

| 
ха 

| 


V ба |x) = 


Name, h ӘМ, 
ІІ от" 


Ша 1 
EN cg (61) 

Substituting in (60), we get 
ТОГЫ E — x (62) 


The Markoff method of estimation requires that As are to be so 
determined that (62) is minimum subject to the condition that 
(59) holds. 


Let 


k 
М, — 1 
ф = Ў, APN; N,=1 май. xi — и 2 (nA; — Ni) x, (63) 


where ш is some constant. Equation (63) can be written as 


EDEN RENE = 


i=1 


+ terms independent of A, (64) 


Clearly, V (Yp | m, n» ..., пр) is minimum when each of the 


square terms on the right-hand side of (64) is zero, or in other 
words, when 


A = P. N—1 ‚ 
pep n NE; (65) 


| 
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To evaluate ш, we substitute for A; from (65) in (59) and obtain 


B 
G — D пр 
E" =n) 


whence 


( =1, 2, =» k) (66) 


Hence, substituting for A; from (66) in (57) and (62), we obtain 


k 
М à. Hw (Ss. LUN 


we ii (67) 


у NL 
пә N, Lx 


іші 


апа 


Мах 2, 
V (Y5 | тоа т = — E (68) 


М—1 
res 


іші 


If the sampling in each class is assumed to be carried out with 
replacement, or alternatively, 7 is small compared with Nj so that 
(N; — 1)/(N; — nj) can be assumed to be unity, (67) and (68) are 
Seen to reduce to 


йы == И. (2) 6% 
ХА 
and 
Nx? 70 
V(Yg |n, n, .... п) = У = (5% 


er the conditions stated in the 


showing that the ratio estimate und 
best unbiased linear estimate, 


beginning of this section gives the 
provided N;’s are large. 
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When, however, Nys are not large, the estimate Y, will be only 
approximately given by (69) and the effect on the variance of Y, 
will be to multiply (70) by the usual finite multiplier (М — n)/N. 
For, estimating М; from the sample, we have 


№, = п; (7) 
n 


and hence 


lS (1) 
N,—n^ N—n 


On substituting from (71) in (67) and (68), we obtain 

Үр = Nig К, (72) 
and = 

V (La taste wong ty) m No", у. Мн (73) 


We notice that the variance depends upon the set of x’s which 
happen to turn up in the sample. In repeated samples, to a first 
approximation, the average value of (73) is given by 


N(N —n) 
n 


V(Yg) = „УХм (74) 
The slight difference between this expression and (375) is due to 
the use of approximations in the derivation of both. 


4a.9 Confidence Limits 


We have just seen that when (i) the relationship between y and 
x is a straight line passing through the origin, and (ii) the variance 
of y about this line is proportional to x; the ratio estimate Ку, 
is the best unbiased linear estimate of the population ratio for 
a given set of x's, with the sampling variance given by 


N= y 
М ‘пху 


It is a well-known property of an unbiased linear estimate that 
if n is not too small and М is large, the probability that the 
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difference between the estimate and the population value will 
exceed a fixed multiple of the standard error of the estimate is 
approximately equal to the probability as determined by the 
normal law. Consequently, the confidence limits for a population 
ratio are obtained in the manner indicated in Chapter П, being 
given by 


li d iss Ni B (75) 


When, however, conditions (i) and (ii) of Section 4a.8 do not 
hold, the exact distribution of a ratio estimate is not known 
to have been expressed in a simple form. For large samples, 
however, the distribution of Rn/Ry сап be regarded as normal 
for all practical purposes, with the standard error given by 


1 
Vin (Се — 296.6, + С» 
We, therefore, expect the following inequality to hold on the 
average with probability (1 — а): 

R, 


(C? – 206,6, + С < 1 < A 


n 


1 
Ree — (a, co) Мп 


1 
+ Қа, со) Мп (C? E 2pC,C, + с, 


yielding the following confidence limits for Ry: 


ws Жы ЕБЕ En (76) 


1 ° 2 
lef ss: "i (C,? — 2рС,С, + C, y 


For small samples, the following method is available. Let 
(xi, уд be normally distributed. Consider a function 


u =, — Ry, w 


Clearly, u will be normally distributed with variance 


5,2 = 1 (5 — 28305,5, 4-48.) (78) 
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The confidence limits for A, with the confidence coefficient (1 — а) 
are then determined by the two roots of the quadratic in R, 
given by 


А уп, аб) _ 
е = (S — 2RypSS.+RyS,2)! 


(79) 


Solving the above quadratic, we obtain for the confidence limits 
of Ку the following values: 


К, чи Pa, со) 
{1 = Pia, со) e Ш п 29 


п 


+ Hee (са — 2566, + C,) 


n 


Pao cece SU 
== c п Ab] e» 


> and С, = т” (81) 


4a.10 Efficiency of the Ratio Estimate 


We have seen that the variance of the estimate of the population 
total based on the simple arithmetic mean is given by 


N (N — n) Зи 


We also saw that the first approximation to the variance of the 


estimate of the population total based on the ratio method is 
given by 


N(N —n) * 
x mum {S,? + RS? — 2RypS,S,} 


Now the relative efficiency of an estimate B compared to that of 
another estimate A based on a sample of equal size 15 defined in 


Section 3a.10 as the ratio of the inverse of their variances. Hence 
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= 85,2 E А 
5,2 + К°5„*—2Кур5,5, 


Efficiency = 


1 
= z (82) 
Ce C, 
n= Ж —- 2 e 
i+ (еч) р (2) 
It follows that, in large samples, the ratio estimate will be more 


efficient than the corresponding sample estimate based on the 
simple arithmetic mean if the denominator is less than 1, ie., if 


C. с. 
(сз) = (0) 
r 


о 
‚© = 
"-ic 
If C, = C,, as will be the case, for example, when y and x denote 


Cá . . 
values in two consecutive periods of the same variate, p will have 


to be larger than one-half in order that the ratio estimate may be 
more efficient than the one based on the simple arithmetic mean. 


Example 4.1 


A. sample survey for the estimation of livestock numbers Was 
Carried out in Etawah (India) during the spring of 1951. Table 
4.1 summarizes the data in respect of the number of livestock y and 
the agricultural area x in all the 364 villages in the Etawah sub- 
division. The range of agricultural area is divided into 7 classes: 
0-100, 101-200, 201-300, 301-400, 401-600, 601-1000 and greater 
than 1000 acres; and for each of these classes, the number of 


Villages (Му, the mean agricultural area per S. бы», ШЕ 
еа i i ӯ.) and the va 
n number of livestock per village On) але Mie i. 


V (x | i), V (y | i) and p; are given (i = 1,2, x ! 
of the ian A жос апа бе НА of correlation oe 
by combining together the first six classes, іе. grouping (082 ЖІ 
all the villages having agricultural area UP to 1000 acres as У i 
as those obtained by combining together all the seven oen А 
also given in cols. 8 and 9 of Table 4.1. Examine, grap к 
Or otherwise, whether conditions (i) and (ii) in Section 4a. њи 
Considered to be satisfied, so that advantage may be Л ЕТО 
Tatio method of estimation for estimating the livestock population. 
n 
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We have worked out in Table 4.1 the ratios of Py, to Ху, and 
of У (у |i) to 5, for the different classes. The ratio of Ja, to 
Xy, Will be seen to be fairly constant, showing that the relationship 
between y and x is approximately linear. V (y | i) also appears to 
vary as x up to 1000 acres but not beyond it. It has, however, to 
be observed that the coefficient of correlation for the last class, 
namely, with villages having area larger than 1000 acres, is rather 
large and the calculated value V (y | i) cannot possibly give for 
this class a correct idea of the variance of y about the line 
y= Кух. On the other hand, any further division of this class 
to study the behaviour of the variance of y with x is also not 
feasible owing to the fact that the number of observations is few. 
As about 35% of the livestock population is accounted for by 
villages with agricultural area larger than 1000 acres, it appears 
advisable to study separately areas less than 1000 acres and those 
with larger acreage. The ratio estimate may be used to provide 
an efficient estimate of the livestock population for the first group 
comprising all the villages in the first six classes. 


Example 4.2 


Calculate for a sample of 64 villages, the sampling variance of 
the estimate of the total livestock population for villages with area 
less than 1000 acres based on (a) the simple arithmetic mean, 
and (b) the ratio method. Hence calculate the relative efficiency 
of the latter as compared with the former. 


For obtaining the variance of the estimate based on the simple 
arithmetic mean, we need the value of У (у) based on all the 
319 observations (N) with area below 1000 acres. This is given 
in col. 8 of Table 4.1. 


` Substituting in the formula for the variance of the estimated 
total based on the simple arithmetic mean, we obtain 


Р N—n V(y) 
> нЕ > А 
V (Мр) = № Wo = 


(319 — 64) .. 8292 
318 64 


= 319? x 


= 10572000 
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Since the sample size is fairly large, the sampling variance of 
the ratio estimate of the total livestock population may be assumed 


to be given by 


N? (N=) py (y). eS : 
да ОО? Re VY GO YQ) + RAA 
where V (x), V (y) and p /VG) Уб) are the variance of x, 
the variance of y and the covariance of x and y, and 


X Ns, 
Ry == —— — 

? Nis 
іші 


Now, from Table 4.1, we get 


_ 03:4 
= 367-5 


W 


Ry = 0:3086 


and, therefore, 


Ry? = 0-09523 
Also 

V (x) = 39528 

V (у) = 8292 
апа 


п the expression for the variance of the 


Substituting these values i ; у 
1 livestock population given above, we 


ratio estimate of the tota 
obtain 


Eyes 319 x 253 x 319 x [8292 — 2 х 0-3086 x 12378 
4- 0-09523 x 39528] 


1271-0 x 39 [8292 — 2 х 3820 + 3764] 


| 


1271-0 x 4430 
5631000 


| 
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Alternatively, we can consider 


6 
1 F 
wot 2, Ро 
ici 


as an estimate of the mean square deviation from the ratio line 
since the correlation coefficient between the agricultural area in 
a village and the livestock population in each of the six classes 
is very low and not significant. In this case, we get the variance 
of the total livestock population estimated by the ratio method as 


1 Шу À 
М(М—п) x іх gla L муо? 
1=1 


1 
319 x 255 x ва X 4602 
— 5849000 


This value is only slightly larger than the one calculated above, 
as one would expect owing to the small values of p within the 
classes and the close linear relationship between y and x. 

It will be seen that the variance of the simple arithmetic mean 
estimate of the total livestock population exceeds the variance of 
the ratio estimate by 88%, showing thereby that the latter is 887; 
more efficient than the former. This large gain in efficiency of 
the ratio estimate is to be expected in view of the high correlation 


between the agricultural area in a village and the number of 
livestock in it. 
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4a.11 Ratio Estimate in Stratified Sampling 


Let Ка, denote the estimate of the population ratio for the 
t-th stratum and У, denote the ratio estimate of the population 
total of y in the ż-th siratum. Then, clearly, the estimate of the 
population total of у over all the strata is given by 


9 
Ys = Ж Yr, 
t=1 
= 3 (бих 2 ха] 
ml i 
k 
= 2) Ry х Nw (84) 
іші 
Мом 
к Р 
Е (Үр) = 2 МХЕ (Ка) (85) 


Hence, from (12), we have 


E(Y,) = Ұнды EX (55, => Ба.) (86) 
іші 


Ма, Хм? Умм 


It follows that Y, is a biased but consistent estimate. To obtain 
an idea of how the bias diminishes with the size of the sample, 
we will suppose that the finite multiplier approximates to unity, 
nz = n[k and further assume that Бај Хур Stylu, and p; are each 
of the same order from stratum to stratum, say Cy, Cy and p 
respectively. The relative bias in the estimate then equals 


k = 
п (са — РС,С.) 


It follows that in order that Y, should provide a satisfactory 
estimate of the population total, the sample size within each 
stratum should be sufficiently large. 


The variance of Yp, to a first approximation, is given by 
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V, (Yn) = E (Yr — Y) 


2 


k 
=E ( 2 (вм, — Фа) 
1=1 


k È 
= Ei Ў Мех? (Ви — Ryd? + У Мун, 
[TIE 


іші 
x (Ry — Rw) (и —В)} (87) 
Since to a first approximation 
E (Ra) = Ry, 


and sampling is done independently in the different strata, the 
product term is zero and we obtain 


И, (Yq) = Х Nu (КЫ) (88) 
іші 


k 9 
tay a Жз (БЫ ше 28,3.) 
SIX "Lodo қат дићи 
t=1 
ip, (М 
NE. ) B cing (S,,2 + Еј би“ — 2 Rp SS) (89) 
t 


t=1 


where p; = N;/N. Using (29), this can also be written as 


Nt 
2 1 4 
V, (Ya) =N 2 a0 9 ot B yea шәә (90) 


The above formule are based on the assumption that п; is large. 
This, however, is not always true in practice. To get over this 
difficulty, Hansen, Hurwitz and Gurney (1946) suggest a single 
combined ratio, namely, 


k 
2 Рди (91). 


е estimate of the population 


and denote this ratio by Rp, and th 2 
е corresponding 


total by Yp, in order to distinguish them from th 
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estimates based on separate strata. To obtain the expected value 
of (91), we write 


E _ k 
= РУт = Ум Фе, and F Ри == Ху + ё, (92) 
= = 
where 
E(&) =0, Е(г,)-0 (93) 
AN, АМ, 
Ее, = ETN че э 2/2) — bo осу 4 
(ën?) Nn PS”, E (2,7) Ма, PP За 
t= ізі 
(94) 


Then to a first approximation 


ER) = i fig E er 2 Ей) (95) 
Hence, the relative bias in Rn, is given by 
= М, 5,2 
LIRE) 69 


та 


То have an idea as to how rapidly the bias diminishes with the 
size of the sample, we will Suppose that 7; is proportional to М; 


and Stx, Syy and р; are constant. The relative bias will then be 
seen to be given by 


N 2: (С,2 ag РС,С,) (97) 


It follows that even when the size of the sample within each 
stratum is small, a combined ratio estimate can give a satisfactory 
estimate of the population total Provided the total sample is 
sufficiently large. 


To work out the sampling variance of the 


M combined ratio, we 
have, to a first approximation 
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ж dunes BRE, 
У, (ты) = та (EEP EGD Шы 


Хи УқХн 


к 
=N ) A Pip (Sif + RaSh? — 2RweSwSis} 
t 
ix (98) 


It is interesting to note that the sampling variance of the combined 
ratio has the same form as that of the ratio estimate based on 
Separate strata except that there is now a single ratio Ry in place 
of Р. 


The difference between the sampling variance of Уһ, in (98) 
and of Y, in (89) is given by 


k 
Np, (8. - nj) [8,2 (Ry? — Ry) — 2 (Rw — Ry) p/Su Sud 


k 
Ур, (М, — n) (8,2 (Ry — Ry +2 (Ки — Ry) 


п 


x (Куби — (i848) (99) 
It will be seen that (99) depends upon the magnitude of the varia- 


tion between the strata ratios and the value of 


(Ry 8,“ — $$) 


The latter will, however, be usually small, vanishin 
the regression of y on x is a straight line throu 


g in fact when 
gh the origin 
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within each stratum. It follows, therefore, that the combined 
estimate will have a lower precision than that based on separate 
strata. On the other hand, the bias in the former estimate will 
be smaller than in the latter. Unless, therefore, the population 
ratios in the different strata vary considerably, the use of a 
combined ratio may provide an estimate which has a negligible 
bias and whose precision is almost as high as that of the estimate 
based on separate ratios. 

Lastly, we shall determine the optimum allocation of the sample 
among the different strata when a ratio estimate is used. We 
shall consider the simplest case for which the cost of the survey 
is proportional to the size of the sample. Assuming that the 
cost of the survey is fixed at, say Co, and that С, = cn, where 
c is the cost per unit in the sample, the optimum allocation is 
given by minimizing the variance of the ratio estimate given by 
(90) for fixed n, say пр. 


Let 


=з D (2 ‚ a Mm 5,8] Éa (X n) (100) 


where u is a constant, and 


Ne 
2 (а — Кухи)? 
S, - si m cR _ 
Ud N,—1 


(101) 


Clearly, ф can be written as 
k 
ф = д (ou == 2 -- terms independent of л, (102) 
It follows that the optimum value of n; is given by 
n, c NS, (103) 


This result is thus analogous to that for the simple arithmetic 
mean estimate, except that instead of the variance of y within 
a stratum, viz., Sty, we now һауе the residual variance of y about 
the ratio line in the stratum, viz., 5,/". 
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Example 4.3 


From the livestock data referred to earlier, it is proposed to 
draw a stratified random sample of 73 villages (amounting to 20% 
of the total) and to estimate the total livestock population for 
the entire subdivision. The villages having agricultural area up to 
1000 acres constitute the first stratum and the remaining villages 
the second stratum. Calculate the optimum allocation of the 
villages between the two strata if the method of estimation to be 
adopted is (i) ratio method with a common ratio for both strata, 
(ii) ratio method with separate ratios for the two strata, and 
(iii) simple estimation within each stratum. 


Also calculate the sampling variance of the estimated total by 
each of the above methods and hence compare their efficiencies. 


The relevant calculations for each method step by step are 
presented in Tables 4.2, 4.3 and 4.4. The tables are · self- 
explanatory. The results are tabulated below: 


Number of Villages 
in the Sample Sampling Efficiency 


Method of Estimation Variance 
Stratum 1 Stratum 2 
(i) Ratio estimate with common ratio 
54 19 8707000 191-5 


for both strata 


(ii) Ratio estimate with separate ratios 


for the two strata an pä 54 19 8688000 192:0 


(iii) Simple estimate within each 


stratum 53 20 16677000 100-0 
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TABLE 4.2 


Stratified Random Sampling with a Single Combined Ratio 
for Both Strata 


(Adopting Optimum Allocation Between the Strata) 


_ {5149 2. Ж 
Ry, = 04 = 0-2919 Ry? = 0 -08521 
Stratum 1 Stratum 2 
Agricultural Agricultural 
Area Area 
0-1000 Acres > 1000 Acres 
а) м, < as gs хе Р 319 45 
(ЛЕДИ 2. ae FA s aiv 8292 58402 
(3) ә (хо 1) E à " 12378 107673 
(4) Клу) .. T ге = > 3613 31430 
(5) V(x|) .. Es T "m 3s 39528 377255 
(6) Ru? (S) 3 ne = s 3368 32146 
9 12 М, 
(7) Residual M.S. Sy, E i (0) — 2- (4) + (6)] 4448 28317 
(8 Sy’ = У(7) es .. “ - 66-7 168:3 
(9) М5, =(1)- (8) .. “э sal as 21300 7600 
(10) m* ais ж. бё T sä 54 19 
(11) N; (№ nj) - ГА a a 84535 1170 
7). (11 
(12) VQ) = 2P т Е .. 6963000 1744000 
VOR) = 6963000 +- 1744000 
= 8707000 
S.E. = 2951 
¥ = (151-9) (364) 
= 55300 
5 _ 2951 3 
%S.E. = 35300 +100 = 5-3 


* Obtained by distributing 73, the total number of villages to te sampled, in 
proportion to N,S,,^ given in row (9). 
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TABLE 4.3 
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Stratified Random Sampling with Separate Ratios 


for the Two Strata 


(Adopting Optimum Allocation Between the Strata) 


Stratum 1 Stratum 2 
Agricultural Agricultural 


Area 


Area 


0-1000 Acres — 21000 Acres 


(D м, " p " 7 319 
Q) Rw, = ўм, [ 3n, .. Б = ..  0-3086 
G) Ry? ~ T » .. 0-095223 
® Yo) " n А 0 8292 
(5) p, YV(xliVQln ТА s .. 12378 
(9 (2-0) » x " s 3820 
D иод " - ICT. 89528 
@) Азир) = (3). (0) .. " . — 83164 
9) Residual M.S. s, = "4 ((4) — 2:(6) + (9) 4430* 
(10) Su | Е 6 
(11) М5, (1).(10).. T ae .. 21200 
(12) ayy " д 
(18) М, (м,н). т r .. 84535 
(14) vg) - D 02 . 6935000 
L (12) .. .. . 
V(Yg) = 6935000 ++ 1753000 
= 8688000 
SE.  =2948 
Y = (151-9) (364) 
= 55300 


2948 100-523 


45 
02650 
0-07023 
58402 
107673 
28533 
377255 
26495 


28464 


168:7 
7600 
19 
1170 


1753000 


7. SE. =55300` 


* The steps leading to this figure are reproduced fro 


Tow ao by distributing the total sample in proportion 


m example 4.2. 


to М5, Shown in 
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TABLE 4.4 


Stratified Random Sampling with Simple Estimate 
for Each Stratum 


(Adopting Optimum Allocation Between the Strata) 


Stratum 1 Stratum 2 


Agricultural Agricultural 
Area Area 
0-1000 Acres > 1000 Acres 
а) N: .. a T 319 45 
0) М-ы 5,2 T as s 8292 58402 
t 
(8) Su? ae ЕВ e 8318 59729 
(4) Sty E- T “ 91-2 244.4 
(5) М,5,, es E m 29100 11000 
(6) n,* 24 ae ~ 53 20 
(7) М.(М,-п) T 4s га 84854 1125 
7). (3) 
(8) V (Yr) = | Ll 2d e .. 13317000 3360000 


V(Yg) = 16677000 


Y = 55300 
$.Е. = 4084 

г _ 4084 РО 
% 5.Е. = 55500 "100 = 7'4 


* Obtained by distributing the total sample in proportion to N;S;, shown in 
row (5). 

The variance of the estimate (ii) is less than that of the 
estimate (i) as we should expect but only slightly so. That there 
is no appreciable gain in assuming separate ratio lines for the 
two strata is also borne out by the fact that the optimum distri- 
bution of the villages in both cases turns out to be the same. 
Compared to the simple mean estimate, however, the ratio method 
is found to be considerably more efficient. 


4a.12 Ratio Method for Qualitative Characters: Two Classes 


We shall now consider one important application of the 
preceding theory to the case of qualitative characters. Suppose 
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the population is divided into two mutually exclusive classes with 
М, and N, observations respectively, so that 

N, = Np, № = № 
апа 

М + = № 


Assume that a simple random sample of л is chosen from М, 
and that л, of the observations in the sample are in class 1 and 
п» in class 2. We shall consider the problem of evaluating the 
expected value and the variance of the ratio Rp, defined by 


В, == (104) 


Let y; (i = 1, 2, ..., М) be assumed to have the value 1 when- 
ever it falls in class 1, and 0 if it falls in class 2; and let 2; 
(i = 1, 2, ..., М) be assumed to take the value 0 whenever it falls 
in class 1, and 1 if it falls in class 2. It is then easy to see that 


n 
в, = 252 № (105) 
Бе ^ 
апа 
bi 
у; 
EX у 
Ry = ES (106) 
. 24 à 


It follows that the mean value and the variance of Rp can be 
obtained from the formule derived in the preceding sections by 
substituting for Х,, Py, Sx% Sy? and p in terms of N, p and q. 
Now it is easy to see that 


X 
ul AUNT (107) 
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N 


2:9 — Му 
ұға ИЕШЕ 
Өү == 
_ Np — Мр 
|. N—I 
N 
= Ра (108) 
Similarly, 
Xy =4 (105) 
апа 
N 
S2= aer (110) 
Lastly, 
2 X — Мн 
PSSy = = р 
_ 0 — Ара 
= (111) 
so that p= — 1. On substituting in (12), we have 


(a s a Nen N pa Қ 
п; м1 т 5 пт") 


-% | и р A (112) 


It Д that when N is large, so that the finite multiplier can be 
taken to be unity, the relative bias in the ratio is given by 1/74. 


The table below gives the values of n for different values of 


q in order that the relative bias in a rati ~ 
о 
exceed 2%. io estimate may not 
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It is seen that п has ordinarily to be very large and particularly 
so when q is small, in order that the bias may be negligible. 


There is one other point which needs to be emphasised. We 
have seen that when the relationship between у and x is a straight 
line passing through the origin, the bias vanishes. This is not so 
in the present case, for when у is 1, x is 0 and vice versa, and the 
regression line of y on x does not pass through the origin. This 
explains the need for a relatively larger value of n as compared 
to that in the case of quantitative characters in order that the 
ratio of the numbers in the two classes may give an unbiased 


estimate. 


To obtain the first approximation to the variance of Ку, we 
substitute from (107), (108), (109), (110) and (T11) in (24), giving 


iN 
Б 
— 
\ 
|= 
|| 
(5 
ES 
O 
ai 
+ 
N 
wa 


ga ESI e (113) 


4a.13 Extension to k Classes 


The extension of the previous results when the population is 
divided into k classes is straightforward. Consider the ratio 
ni[nj, where n; is the number in the sample in the ;-th class and 
nj in the j-th class, and assume that y takes the value 1 whenever 
it falls in class i and 0 elsewhere. Similarly, let x assume the value 
1 when it falls in class / and 0 elsewhere. We then have 


п; N; 
== = 114 
К, п, Ry М, ( ) 
Эм = рь Xy = Pi (115) 
N 
5, = езт (1 — р) (Ше) 
å N 
Sip — p) VIT) 
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and 
N 
pS,S, = — NP (118) 
so that 
P:P; i 119) 
2 Аа | 


Substituting from the above in (12), we have 
ЕЕЕ 
ТЫЛ 
mB PANS dne) 


Ni N-—n 1 1 
ex nb qtd ей 23 
v { TNI р (120) 


It will be seen that the formula is identical with (112) except that 
4 is now replaced by p;. 


To obtain the variance, we substitute from (115) to (118) in 
(24) апа obtain 


R) Nm 1 (х= P (l — р) 
КЕ N n\N-1 pe 


N= p? М-І pp, 
N-n 1 Пер y Т=> 
|. N—1 zi р; Бр р +2) 
N-—n 


- 
| 

уы 

Ші»- 
тт 
Y= 
Y= 
ШЕ 


(121) 
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B. SAMPLING WITH VARYING PROBABILITIES 
OF SELECTION 


4b.1 Ratio Estimate and its Variance 


So far we have considered the theory of the ratio method of 
estimation for samples chosen by the method of simple random 
sampling. We shall now give the basic theory of the ratio 
estimate when sampling is carried out with replacement with 
varying probabilities of selection. 


Let 

WC TES 

2: = NP, Jute 
Xi 

9 = Np. = Xyd e 

and 

јр == Zi aMi 

Ы vi x 

К, = + 
Un 


lt is then easily shown that 


Е (2,) = Ју 
Е (2,) = Fy 


E (е,2) = oe = 


E (vj) = Ху 
Е (Б) = Xy 


Ju Ci — ју 


_ 


22 n^ 0,0; 
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On substituting in (9), it follows that 


„1 (o2 _ pmi) 
E, (R) = Ry { 1+, (2 S ) | (122) 
Also, from (22), we have 
21 fe; о; Зра,а, 
у (8) = ве „ (55 e 2, — S) 123 
1 (Re) мп К Та Ји a23) 


This can be rewritten as 


1 2 2.2 
V, (Rp) == (0,2 + Ryo,” — 2Rgpo,o,) 


1 N N 
Ee. P.z2— ў, RS Hh өзі -- дей 
тар ү? 122 — Ју + Ку ( Ри s) 
i=1 c іші 


М 
- B ( Ж” ЕУ) 
іші 


М 
1 ж 
= пай > P, (2? + Rv? — arem 
іші 
М 
1 e 
=з Із P. — өзі (124) 


ог 
г Ју 
Vi (Ra) = nN? | D. (%- | (125) 
It follows that 
N 
М? 
V, (Yn) = E n Р, (z — Aen (126) 
іші 
ог 


N 
V, (Yh) = 1 > Р, б— te (127) 
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The estimates of the variances of Ry and Yp are obtained 
directly from (40) and (41). We write 


Est. VR) mgr Gy ec Rav . — 029 
and 
Est. V; (Ув) = x n" = 1 D (2, — Қа) (129) 


Finally, we note that when Pi is proportional to x; so that 
Pi = x;/NXy, we have 


= 
z == sb == Ўр = г, 
is Nee ~“ ы 
2 94 X 
‚= i = 
NP; т 
2, Fay, Un = Xn 


and 
Ур = №, = Му", 


айо estimate Ур is 


It follows that the sampling theory of the г 
We there- 


identical with the sampling theory of the estimate Zn. 
fore have from Section 2b.2, 


Е (Yp) = NE (2,) 
= №» 


е ratio estimate for this case provides an unbiased 


showing that th 
urther, from (126), we have 


estimate of the population total. F 


2 N E" 
(у =". ње DR ta? 


п іші 


апа, from (129), 


182 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


. The method of forecasting acreage of principal crops in India 
provides a good example of the application of the theory presented 
in this section. Normally, area figures are collected by the village 
accountant, field by field, for all the villages within his jurisdiction. 
Such complete enumeration is, however, not available in time for 
making pre-harvest forecasts. These are consequently made on 
the basis of advance enumeration of a sample of villages selected 
with probability proportional to the cultivated area (including 
fallows) with which the area under a major crop is known to be 
highly correlated. The ratio method of estimation on the previous 
year’s figures is used. 


Example 4.4 

Table 4.5 shows the total cultivated area during 1931 as also 
the area under wheat in two consecutive years 1936, 1937 for 
a sample of 34 villages in Lucknow subdivision (India). The 
villages were selected with replacement with probability propor- 
tional to the cultivated area (including fallows) as recorded in 
1931. The total cultivated area in 1931 and the total area under 
wheat in 1936 for all the 170 villages in Lucknow subdivision 
were known to be 78019 and 21288 acres respectively. Estimate 
the area under wheat for the subdivision for the year 1937 using 
the ratio method of estimation and calculate the standard error 
of the estimate so made. 


What would be the standard error of the estimate if the 
information for the previous year were not used ? 
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TABLE 4.5 


Values of Total Cultivated Area and of Area under Wheat 
in Two Consecutive Years for a Sample of 34 Villages 
in Lucknow Subdivision 


Serial Cultivated Arme ы TUUM ј 10007 
ates Ae m 1936 1937 : a 
Gore Uem (сең “разы 7 54554 
а) (2) (3) (4) (5) (6) 
1 401 75 52 187 130 
2 634 163 149 257 235 
3 1194 326 289 273 242 
4 1770 442 381 250 215 
5 1060 254 278 240 262 
6 827 125 111 151 134 
1 1737 559 634 322 365 
8 1060 254 278 240 262 
9 360 101 112 281 311 
10 946 359 355 379 375 
11 470 109 99 232 211 
12 1625 481 498 296 306 
13 827 125 111 151 134 
14 96 5 6 52 63 
15 1304 427 399 327 306 
16 377 78 79 207 210 
17 259 78 105 301 405 
18 186 45 27 242 145 
19 1767 564 515 319 291 
20 604 238 249 394 412 
21 701 92 85 131 121 
22 524 247 221 471 422 
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TABLE 4.5—(Contd.) 


gai, Ойша SO T. ро 0х р. 0007 
Village 1931 $ z 
ere фе) ыы = 0:4594 7 0-45894 
(1) (2) (3) (4) (5) (6) 
23 571 134 133 235 233 
24 962 131 144 136 150 
25 407 129 103 317 253 
26 715 192 179 269 250 
27 845 663 330 785 391 
28 1016 236 219 232 216 
29 184 73 62 397 337 
30 282 62 79 220 280 
31 194 71 60 366 309 
32 439 137 100 a» * 228 
33 854 196 141 230 165 
34 824 255 265 309 322 
Total n 9511 8691 
Crude Sum of Squares dx 3166531 2505925 
Crude Sum of Products Bi 2727616 


Let a; denote the cultivated area in the i-th village and x; and 
yi the areas under wheat for the years 1936 and 1937 respectively. 
Then P; the selection probability for the i-th village is given by 


N 
Р; = ја, where A = X а; = 78019. 
іші 
Also 
BSS я and p emu а= 
For the sake of computational convenience 1000y;/a; and 1000x;/a; 


have been calculated instead of z; and vj. These аге given in 
cols. 5 and 6 of Table 4.5 and denoted by /; and 1', where 


e ai 


we 
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_ 1000N z 
£7 чш ы 0-45894 
and 
|, _ 1000N,, _ è 
КД 77 0-45894 
Now 
E Sl, _ 8691 
Е зак 00 
Xv, Al 


= 0:9138 
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Hence the estimate of the area under wheat in 1937 is given by 


Y, = R,X 
= (0-9138) (21288) 
= 19453 acres 
From Table 4.5, 


Я 
Непсе 


хо шуу = 1? 


3 12 — 2505925, 2) = 2727616 and Z 1 = 3166531 


—2R,2 M RAE 


үф (= Ка) 


= 165082 
ог 
2 (2, ж? Кад) = = (sion) 
= (0-45894)* (165082) 
= 34770 
Непсе 


(т = T = = =} (2, —R 0)? 


(170) 
= @ 85 - 34770 


— 895600 acres* 
S.E. Yp = \/895600 


— 946 acres 
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If the information for the previous year had not been used. 
the estimate of the area under wheat in 1937 would have been 


Y = Nz, 


\ 


(0-45894) -| - ХІ, 


(0-45894) (170) (8691) 
Se 


= 19943 acres 


and 
y N? 1 5 3 zy 
yd)- V... Le-a 
ES (us) Duo МИ 
n n—l 1000N. 
_ T0 0. 2 
= (34 (330 45894)? (2505925 — 2221573) 
= 5% (0-21063) (284352) 
= 1542700 acres? 
ог 


S.E. У = 1242 acres 


The increase in efficiency in using the previous year's infor- 
mation 


үа 1542700 — 
895600 ) 
= 72-30 
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APPENDIX 
Expected Values of Certain Higher Order Product Moments 


We shall derive expressions for 


(i) E(e,5,7), (ii) Е(ғ,ғ,3) and (іі) Е(ғ,2ғ,) 


Let 


А ыы а) 
We may then write 


Lif a.b. 'a ‘p= y а-ға a bor V t 
Z e'eje/^e, = У се Ж ee B — ере 
#22) i j 


a = Мр, st, в — Ми, asg (2) 
50 + 
М b 
А єбє; ee" e/Be,'y 


T 2 eie «/'®є/В (2 е е-е үу — 44?) 
= Миу (Мара, abe, g — Neos, а) 

AN His, ary Ho, в — Масъ, c4 pay) 
N (Ань, Езу Ha, а — Маз», as p+) 
= Мін, abe, вису — М (haso, ави. y 

+ Hate, ағу Ho, в F aae, куйа) 


2n 2Nu. e, [y (3) 
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Armed with these results and the theorem on expectations in 
Section 2a.9, the derivation of the expected values of higher order 
product moments becomes straightforward. 

Thus 


ЕРТЕДЕ 
[GGe деді 


Ж ,E[Z2 «et Ж ауы Паки + 3 ae; «] 
i ij ii ijk 
(4) 


N N 
= = [а Daet e Х (eie? T Јев! ву) 
с i Ai 


N 
tex «4/44 | 
ijk 
= P ЕС + е (— Маз- 2 Мила) + ез QNa3] 


= м (в —3e + 24,| Ила 


n 
(У — п) (№ — 2n) шз (5) 
(М-І(М-2) т 


It follows that 
Е(г„%) = x (ex — Зе; + 2es) Hos 


(У — п) (N — 2n) bos 
(N—D(N—2) т 


(6) 


Similarly 
E (2,6,?) = E ën’) (6,6, 1 


Substituting from (4), we write 


1 n ; n ä n б п TNT 
E(ng?) = WE (е «) (Sac + 3 ast +23 «че 


464 9 
ЖУ аве 
[d 
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1 Р, 3 Ie ia 
= e E ГЕ є;є;'3 + 2 (66% + Зе Зе + 3ee;'e 2) 


п n 
+ XN 3 (ee ee,’ + 6626) + X € «9 | 
ізбізек ізбізерәбі 


= 1 [o2 2 єє + ё. 2 (ag ete] Mein! ej? 


nt 


N N 
Hes X (3ве/еуе, -Зее у ер) Ке, X сву вв | 
66 16) ul 


= А (42 2: «e? + e 2 («гє ®-Е Зе "e, +3 ee; 6") 


m 
У в 
+ ез ве е, + Зее, 
c j €k 167 ey) 
N 


N 
Ва 2 age (2 q = «€ — 4 — «)] 
у УЯ 1 


nt 


1 N N 26. 
= = (в D eici + e У (eej H 3eie;? ej --3e;e' ej?) 
i 17) 


+ es » „Эче ej e + Зе Зеу) 
1) 


N 

TON mur 

—e, X (аве, 4 2 «)| 
ijk 


Using the results (1), (2) and (3) above, we get 
LAC 1 
E (nën) = ni [ М» + ez (— Мила — ЗМша + 3 (Мицо 


— №Маз)) + Зез (ANu5— 2N зано) Ба е, (ба — 3N Зно] 


= ge [е Te +12% — бе) Ма, 


т 


+3 (es — 2es + ел) Nip, | (7) 
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__М—1 "(ғ + N — би + 6n?) 
Um 


8 (п= DON а) 
"(М = см ==) (N— 5 Nites (8) 
It follows that 


E (ë) = 2, [e — Te, + 12e, — бед) Neos 


+ 3 (e — 2es + е) Nes (9) 


or 
N —n f (М + № — бпМ + би?) 
UP [d —DU 2) 8) 
3(n—1)(N—n—1. p 
DD ES Ми | ao) 
Lastly 


Е (е,е,°) = E E [(2 “) {2 ee; + 2 ee" 4-2 в: еу) 


“Г b: «s«']] 
#23726 
= a E [2 es? + 2 Caga? + efe? + 2e?«'ej 
n i ize. 
d , , 
ев & + DY 
4-2ве еу) M ere „а 6; SA 


/. ТЕ 
(еее) t Deeper e + єє € + 26,66 «) | 


\ 


k [es + ез (N*H20H02 + 2А — ТМиа) 


nt 


— ез (QN*uaoltos + ЧУ m? — 12 Ми) 


N . 
—e X (өй -2«9«/«)] 


ізеізек 


т 


= [амь + еъ (Мер + Миа — С) 


— е, (2№изоно + AN?u,? — 12№изз) 


+ eg (УЗизоног + 2%? — 6N2 | 


. CDC 
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1 
= т ЕС — Te, + 12е; — 6e4) Nus. 


+ (& — 2ey + e1) N? (Hooton + жеђ | (1) 


Esser ЕЛ 
=~" LW—)W—2 Ww 3) = 


(п— )(/—п—1) y, 
WD) 3) ( — 3) ^ Geta] 


(12) 


+ 


CHAPTER V 
REGRESSION METHOD OF ESTIMATION 


5.1 Simple Regression 


In this chapter, we shall consider the regression method of 
estimating the population total (or mean) of the character y 
under study. Suppose, as previously, that the population is 
divided into К classes with, say, Nj units having the value x; 
each (i = 1, 2, ..., К), and that a simple random sample of n 
is drawn from the population N. Suppose, further, that in 
repeated samples of п, the number of units having the value x; 
is fixed, say given by 7; (i = 1, 2, ..., k). In simple regression 
we postulate a procedure of sampling in which 7, 7s, ..., Mk 
units are drawn from their respective classes with replace- 
ment. Further, we assume that the mean value of y for a given 
x is linear in x and У (у |x) is constant. In other words, y is 


of the form 


уу =a + Вх + ©; а) 
where 

Е (ey |) =0 
апа 

Е («f | i) = constant, say, у » (2) 


Summing up (1) over all the N units in the population, we have 


Y= № + BE Nos (3) 
When а and f аге estimated from the sample and the population 
total of x is known, the right-hand side of (3) provides an 
estimate of the total. This estimate is known as the simple 
regression estimate. To distinguish it from the estimate of the 
population total based on the ratio and the simple mean methods, 
it will be denoted by Үр and the mean by yr 


13 
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5.2 Simple Regression Estimate and its Variance 


We have seen in Section 2a.3 that the best unbiased linear 
estimate and its variance are given by the Markoff method of 
estimation. Suppose that Y; is given by 


k 
Ү,= 5 NAP ni (4) 
i=1 


where 4/5 are to be chosen so as to satisfy the conditions of the 
Markoff Theorem. The condition that У; should be an unbiased 
estimate of Y gives 


k k 
E i£ пады) = 2 Мі, 
On substituting from (1), we obtain 
k k 
= nà; (а + Вх) = = М, (а + Bx,) 


which сап be written as 


P (X — №) (a + Bx) =0 (5) 


The second condition of the Markoff method of estimation is 
that the variance of the estimate should be minimum. Now the 
variance of Y; when 7, ñ» ..., п. are fixed and sampling is 
carried out with replacement with constant variance y in each 
class, is given by 

B 9 
VY; | my n, ..., т) = 2; nA? 5 
іші 


where 

Ni = Ж 

Z Uu — м)? 

оё = — — ey (6 
Hence 


k 
V (Y, | m, њу, .. п)=у 2 na? (7) 
=1 
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To minimize (7) subject to the condition (5), we shall use 
the Lagrangian method of multipliers and form a function ¢ 
given by 


$ =y Ў nd? =p E (им — No (а + Bx) (8) 


where и is the Lagrangian constant. 


On differentiating ¢ with respect to A;, а and 8 and equating 
to zero, we obtain 


M а паду ра + Ре) =0 | (ређа об 0) 


dA; 

Ме — од №) =0 (10) 
апа 

М k 

Бі = => Ё х (nA; — №) = 0 (11) 


From (9) we have 


Х, == и (а + Bx) шы a'4- Ех, (12) 
2y Y 
Where 
xg , _ BB 13 
“-% апі В' = > (13) 


Substituting from (12) for Ағ 5 in (10) and (11), we obtain 


5 Go! + #2) = М 0% 
апа 
k 


E auge + Е үз mx? = Мр (15) 
76 y 4 


іші 
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Solving for a’ and В”, we get 
, Ny (Ху — X, 
В > Ny C N ) (16) 
2 Те Са — X,» 
i=1 
and 


E [ = бл ТАН (17) 


Е 
Ут (x; — Хаг 
іші 


On substituting for В’ and a’ from (16) and (17) in (12), we 
obtain 


nud uu гы. 
Aw =” ШЕРІ 


іші 
Hence, from (4), 
k 
2 nj, (Xi — Fn) 


Ү, = N« y, + ы —— (Яң ==) (19) 
= т Оң — xy А 


Estimating У from (3) by 
Nâ + NxyB 
and equating with the right-hand side of (19), we have 
а = 9, — fs, | (20) 


апа 


k N 
Я E ту (х; E X.) 
pom (21) 


k 
2; ny Кај“ 
іші 


To obtain the variance of Y, we substitute for А; from (18) 
in (7). We then have 


ЕЗ 
VY) = у 5 na? 
іші 
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Clearly, the middle term is zero, giving us 
V(Y) = му [ + „ба a (22) 
PLC CL 
which can also be written as 
Q3) 


_ Му (ту = њм) 
yq = S {1 + He} 


where m, is the sample mean Ху, тә denotes the second moment 
of the sample about its mean and ш the population mean 


of x. 
Pooling (6) over K classes gives 
k Ni 
2 (p 9 — Bx) 
уж= (лы аласы = (24) 
Х М; 
From (20) and (21), we have the identities: 
(25) 


а = yy — ВХу 


and 


k Ni ы 
У У уц Qa — Xy) 
4=1 ј=1 20 = 


Bo» eiie 
N; (x, — Хи) 


or 
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where the, summations in both numerator and denominator are 
carried out over all the N pairs (хь уч) 
О, 


=p —, say (26) 


2 


On substituting for а and В from (25) and (26) in (24), we get 


N 
І А _ ув 
r= y 2, fy -7y cp (x я) 


{ s 2 Č 
А Р о, у aa 


= œ? (l — (8) (27) 
We may, therefore, write (22) as 


V(Y, | т, ћу ses т) = N?a,? (1 — p?) 


|, ба 5) 3 
xdi R5] ор 
Б ni Ge = еті 


It will be noticed that the variance consists of two terms. 
The first term represents N? times the residual variance of the 
mean of a simple random sample of n, estimated from the 
regression line, when В is known; while the second represents 
the increase in the variance of the estimate when P is determined 
from the sample. As will be shown in Section 5.4, the latter 
contribution is of an order 1/n2, so that if n is large, the variance 
of the regression estimate may be regarded as being given by the 
first term only. It should be pointed out, however, that the 
sampling is assumed to be carried out with replacement. For 
a simple random sample drawn without replacement, exact results 
are not available but it is surmised that the effect will be approxi- 
mately to multiply the above expression by the finite multiplier 
(N — n)/N. 


5.3 Estimation of the Variance of Simple Regression Estimate 


To evaluate (22) we require the estimate of y. A straightforward 
method of finding this is to substitute for a and В their sample 
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estimates in (24) and calculate the expectation. Consider then 


a quantity Q given by 


о-2 fos. — B(x — x (29) 


іші 


and calculate its conditional expectation for fixed л}, m» ..., пк. 
This is best done by expressing Q as a function of the es, which 


are defined by (1). 
From (1) we have 
ўм = в + Bx + Es (30) 


whence 


У пи = = na + Bnx, + 2 ем 


Име 


k 
2 ME nj 


у, = a+ BX, + = (31) 


k 

À тум Оң — Хо 
је === 

5 n, Qa — XY 


2 n, (а + Вх, T) 2 — Хо) 


іші 


k 
2 ni Qa — XY 


n IT (678) 
а Se Рв: 
У т (xi — Xn) 2 n; (xi — X, > 
іші 


= в + 5 ко. nai | (32) 


айы! s | 
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On substituting in (29) from (1), (31) and (32), we get 


k k 2 
ni Ж та, (%=—=®,) Din, 6 3.) 6, 
r=1 


k 
E ee 
j Di п, Ga — Хај“ 
tei 


(33) 
Setting bi nj (xj — Ху“ = nm, and expanding the right-hand side, 
іші 


we obtain Е 


ТТ 
ш пт? 
i 


(x; — 2) 2 n (х, — Fa) 6, 
е —— HR = 


пт» 
коң * ° Е “еш 
1 2) та, (X, — Fn) 
+ кћи п 2, He, идиом пт, 
= j тті а 
k т | k k 
ea „7 ы те 
) ) Himi ) пре а + ) птё sS 
бан 7 (ті iAl=1 
k 
=d ) | neen (x; — xy 
пт» „Ж. 1 
ісі 


Ё 
+ 2 MME n Em (x; — Хо (x, — «| (34) 


1521=1 
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Taking expectations of both sides in (34) and noting that 


Eka) = у 
E (2,8) = - 
and 
E (ënn) = 0 
we obtain 
Е(О) = (п — 2) у (35) 


It follows that 


Bit.» = TES 5 (36) 


On substituting in (22), we obtain 
О 


п—2 


х 1 E ба — (37) 
9 DP Cotes х.) 


іші 


Est. V (Y, | т. па... №) = 


54 Expected Value of the Sampling Variance of Simple 
Regression Estimate 
The expression for the sampling variance given in (23) or (28) 
depends upon the x's that have turned up in the sample, and 
SO cannot be used for purposes like comparing the precision of 
the regression estimate with other estimates. We therefore 
proceed to obtain the expected value of the variance over all 
samples of size n. 
Since the first term in (23) is 
problem reduces to that of determining the expect 


independent of the x's, the 
ed value of 


(т — ш)? 
то 
; T Zero. 
Without loss of generality we may assume pı to be 


ined 
Obviously, the expected value of m/m, can ђе determin 
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only approximately; since both the numerator and the denominator 
are random variables. Following Section 4a.3, we may write 


z (2$) me (mi (ime н. en 


where pa denotes the second moment of x in the population. 
The expected value of the terms in my (ms — Ун! and higher 
powers will be of an order smaller than 1/n, and can be neglected 


if n is assumed to be reasonably large. We, therefore, write to 
terms of order 1/72 


E лы.) = Б (25 хэй {ч (m = d! 
тә из Be 


+ Е { mÈ (e ну) (39) 


The values of the expressions on the right-hand side have been 
tabulated for ready application (Sukhatme, 1944). A reference to 


these formule gives 


m nt 


Е (mem) = Ма Б ағ 20 _ 94 + =) 


„ Г(е; — es) (ез -- Зе; + e2] 
y2,, 2 (MR: — - € 2 
+ Ning [ n 3 n* 

where 


а a(n —1)(n—2)...(n—j +1) 
ООЛО 2)... —7 +1) 


Neglecting terms of order smaller than 1и? and retaining terms 
in 1/N and 1/N? for completeness, we obtain 


1 3 2 
CE (ning = на Ga TAN ла) 
1 4 1 10 6 
С a x б. 
+ ва 6 "NC aN м) (4%) 
Similarly, оп neglecting terms involving 1/n? and higher powers 


of 1/n in the formula for E (m,m?) as tabulated by the author 
(1944), we have 
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sae 3 Se 225 aa ee 
E (тт?) = papa 3 -NF м) + на (5 SAN 245 =) 


Substituting in (39) from (40) and (41), and using the known 
result that 


год =» (=) 


we have 


where 


2 
В, == and Ё, = Es 
H2 из 


ТЕ the x's can be assumed to be normally distributed, 50 ae 
В, = 0 and В, = 3, we can write the expression for E (m2/ma) 


in the exact form as 
mt) 1 (r-i =, a 
s „ы! n 2—8 


The variance of Y; to terms in l/r? is thus approximated by 
"(1 —° нер | 44 
YQ) e ve [HA (14 ; i (4% 


and to terms in 1/n by simply 


N* 2(1 — 2) (45) 
y (yy a, э О = 
we have only to divide the 


To obtain the variance of the mean Ji. or (45) by №". 


expression on the right-hand side of (44) 
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5.5 Weighted Regression 


In developing the preceding theory for simple regression, we 
have assumed that: (a) the regression is linear, (5) the deviation 
from the regression line has a constant variance, and (c) sampl- 
ing within a class is carried out with replacement. These 
assumptions may not hold in actual practice. In this section, 
we shall extend the theory to the case where the relationship 
is linear, the deviation from the regression in any class has 
à known variance and sampling within a class is carried out 
without replacement. The extension is due to Hasel (1942) and 
Cochran (1942). 


Let Y, denote the weighted linear regression estimate of Y and 


Suppose that it is represented by the following linear function of 
observations : | 


k 
Үд = 2 nj, (46) 
=1 


where A's (i= 1,2, ..., К) are to be so determined that: (i) Y, 
is an unbiased estimate of У, and (ii) its variance is minimum, 
Now, the first condition gives 


Е (Үл) = Y 

Le., i 
X nà; (а + Bx) = > N; (а + Bx) 
іші іші 

or 


È (им — N) (a + Bx) = 0 (47) 
іші У 


The variance of Уу for fixed x’s is clearly given by 


k 
V (Ya | m, ng ..., n) = Hy né? мт «8б 


i п; 


i=1 


k 2) 
npr? 
a Тыч (48) 
= А 
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where 
etm 
Ж” 52 (N: — n;) (49) 
and 
Я (уу- Ун » " 
ades тш 


To minimize (48) subject to condition (47), we form the function 
$ given by 


k 
ф = iS Ts M - » (nx — МӘ (а + Bx) (50) 


i=1 
On differentiating ¢ with respect to А, a and В and equating to 
zero, we obtain 


54. = MN — ит (а + Bx) m0 @=1,2,..› k) (51) 
i ия 
whence 
52 
М = u (e+ Bx) oa (52 
i = = 2 (тд; = № = 63 
i=1 
and 
3 4 
m к =й Ж x, Gus — №) =0 (59 
Substituting for 275 from (52) іп (53) and (54), we opran 
55 
aW + рит, = М 2189 
апа 
i ; (56) 
«УХ, F В' 5 эх? = Му 
Where 
(57) 
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and 
k Ў ИХ 
= 2 » Xo = 77 (58) 
Solving (55) and (56) for a’ and 8’, we get 
oe am (ту - —) ___ = N Ey — 2) (59) 
= их“ — Их? Зо; 
апа 
н N NS, (Xy — 5.) N NX, (Ху — Хо) (60) 
а == = - 
d 2 а-а” Из, 
where 
Ў wx? — Wie 
| Ser = Е (61) 
On substituting for a’ and В” іп (52), we obtain 
жау Nw, Ху — = Хо = 5 
қ-а ралы АД (62) 
Hence, from (46), we һауе 
k 
іші 
which шау be written as 
Yui =N et By — %,)) . (63) 
where 
1 k 
„= yy B ws (64) 
i=1 
and 
8 = pa | нади — таз) (65) 
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To obtain the variance of Ут, we substitute for A/s from (62) 
in (48) and obtain 


= fe | Gy Хе) 
уста) = [ +e ] (66) 

It can be verified that when sampling within a class is carried 
Out with replacement and (у | х) is a constant, formule (63) 
and (66) reduce to those appropriate for the simple regression 
estimate. For, we have in this case 


and 


X тӯ, Оң — Xn) 


іші м 


о» 


and on substitution we find that (63) and (66) reduce to (19) and 
(22) respectively. 

An examination of the expression for Үт in (63) shows that 
à knowledge of the true weights w; is not necessary for calculating 
Үш; numbers proportional to w; are sufficient for the purpose. 
The variance of Кур on the other hand, requires a knowledge 
of the true weights и. This raises а practical difficulty, b 
the true weights w; are rarely known. Often, however, ы 
relationship between the variance of y for a given х, and x "e p 
guessed, and numbers proportional to w; can be Meare 
therefore, important to investigate the form whic 5 ois 
of Уш takes for certain well-known relationships gei e 
and x. We shall consider only the simplest situatio 
V (y | x) is proportional to x. 

Let 

(67) 

У(у|х) = ух 
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It follows then from (49) that 


К Hm , N-—l 
"© VOT) М-а, 
T М — 1 
ух Мт 
_ (68) 
у 
where 
Pf аа 
w = S е-е (69) 
and 
k 
w=; wet (70) 
yY Y 
іші 
On substituting in (66), we have 
2, | ` Gn — X 
Va) = Му] usd — Gin — 5) (71) 
a w (x; — 5„)? 
іші 


which now depends only on the known numbers м; апа the 
constant у. 


5.6 Estimation of the Variance of Weighted Regression Estimate 


We shall consider this problem for the case where the variance 
of y for a given x is proportional to x. Since the variance of the 
weighted regression estimate for this case is given by (71), the 
problem reduces to that of the estimation of у. 


As in Section 5.3, we shall start with the quantity Q defined by 


о = 5 ү? we Е — Fo — Ê (xi — x) ] (72) 


and proceed to calculate its expectation for a given set of x's. 
A straightforward method of doing this is to express (72) in 


REGRESSION METHOD OF ESTIMATION 


terms of ej; and ж, and then take expectations. Now 
defined by (1), namely, 


Jg = а + By + 6g 
where 
Ely = 0 
and 
Е(еР |i) = VQ 
= уж 


From (1), we have 


Ў = а + Вх, + ец 


Where 
S? N,—n 
Е (е2) = = Ee 
yx; № = т 
n №1 
= Ш 
м 


Also, from (74), 


k 
T 
wy, 


1=1 


ni 


k ts 
ы c У Wi Eni 
Уы Swine + 2 
= a wi + BA ігі 


Whence, dividing by W’, we have 


Ў. == а + Big +, 
Where 


209 


57 15 


(73) 


(74) 


(75) 


(76) 


(77) 
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Further, from (65), 


n wi (x; — Хо) su| 
= m wi (x; — Хо) (а + Bx; e] 


= ws » Ву (x; — Х„)? + У Wi (6) «] 


| А 
в + қа У Wi! (Xi — Хо) ги (78) 


i=1 


|| 


On substituting for yj; from (1), for ў from (76) and for 
B from (78) in (72), we obtain 


ni 


k 2 
— (X, —%,) GE + — ЖЖ. 2 wi (х,— 5.) «J] 
k ni z 
= 2: 2, = је — о) 
à 1 k 2 
— (x; — Šo) * W's? 25 wy! (x, — Хо) «| 


і=1 
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k ni ; 
и; "T 
= F (є Є)" 
H № my ov e) 
іші d 
k ni > 1 
Wi 
ЗЕ = n (x; ee? 
ЖЫ T 
i=1 i 
| у; i 
X | We! Qi — Хо) ги 
іші 
ko (mi Р - 
Wi 
—2 “. 3 (x; — Хо) (ву — ё 
Ж ЕЕ VICE 
i 2 Е 
| y "E 
xX | We! Qu — Хо) Eng 
іші | 
È (mi Р ко оъ, ы ni 
Wi қ E Wi ~ Wi ei 
= ) ep t ëw ) ) 2 гь 
m” 5 п; п; 
іші і іші j іш 7 
k 2 
1 ) " XE 
i jS У 
Бі ° We! Са — Хо) би 
Wiss 
t=1 
T 2 
2 uc 
TES Wy! (Xe — Xo) ги 
(АС, 
(ші 


| 
~ F3 
Ё 
w 
= 
D 
= 
kaj 
| 
~ 
х. 
т 
2 
ә 


t=1 
" л k k 
kon 
р 2 l „е ° + Wi Wh Ene 
= x ens = W' Wi “En T ih Сет 
z E 7 іші inet 
ісі 4 


k 


1 З E zy 
~ W's, 2 Wi Зета (9 = 25; 
wr 


i=1 


Е 
23 Wi Wh бај ба Хе) (а-а) (79) 
ізбһті 
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On taking expectations and noting that 


па == > 
EG) = 7, 
and 
E (ëmën) = 0 
we have 


EQ) = |È чм) 1 (80) 


i= 


It follows that an unbiased estimate of y is provided by 


Est. у = —c (81) 
De жа d 2А n (УМ, — 1) 2% 2 
Wins . =; 
Now, let 
ји n, p Oy = =" 
Пру = сті (82) 
Ха 
апа ш 
k 
= Wil Gg — Хо In, 
ош ы” Уа 
then Q can ђе expressed as 
Q = W's,? (1 — в) (84) 
and we have 
Nsa (1 — 12) | 1 у би = x 
Est. V (У) = = [ "nm ] (85) 
п: (У, E= l) = 
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When sampling is carried out with replacement and 


У(у|х)=у 
then 


SD Y (а — X) 


ns: (1—7) 


n—2 


and we have, as previously, 


[ DE d 


Est. V (Y) = 


5.7 Comparison of Weighted with Simple Regression 


Ns, (= >) 
ЕБ 


2 


213 


(86) 


Тһе sampling variance of the weighted regression estimate, to 


terms in 1/n, is obtained from (66), being given by 


Yu = 5 
Where 


k 
W == 2 


(87) 


while that of the simple regression estimate to the same degree 
of accuracy is from (22) given by 


voy = МУ 


(88) 
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However, (87) and (88) are not directly comparable. To make 
them comparable, we shall suppose that sampling is carried out 
with replacement, so that we may take 


М;— то 


, Equation (87) then becomes 


kn (89) 


Ni 
URS 
Qu — In)? 

EL 


оц = * N. = 
i 
and 
Ge) 2 
V (Y,) y » ча Ой 
іші 047 
Now, let 
gi coy s E: (02). (91) 
So that 
Е (8 (од) = 0, Е(5 (o)? = (ог) 
апа 
7 k 
т 2 To 
a of 865 
= УІ ——- 
e у (1+ 2) 
в 
= т _ 8 (0,2) {8 (о0,2))2 
provided 


/ 


ШЫР 
У | 
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Taking expectations, we obtain 


в 
Е | | nn ҮЧЕ . - 
(2 5) ES + F Сага Р | (93) 


Where C2,.2 is the square of the coefficient of variation of oj 


ci? 


The average value of (90) is, therefore, approximated by 


У Qa) 1 K 

E [у] = грба ы 
The result is due to Cochran (1942). It shows what is indeed 
obvious, that if the os do not change very greatly, simple 
regression may be used without appreciable loss of precision. 16; 
on the other hand, о;?% vary very considerably, the use of the 
simple regression will lead to loss of efficiency. 


Example 5.1 

Table 5.1 summarizes the data for a simple random sample of 
64 villages drawn from the total of 319 villages referred to in 
Example 4.2. Assuming that villages within a class are of the 
Same size, equal to the mean value per village in that class, 


TABLE 5.1 


Summary of Data Relating to Agricultural Area (x) and Number 
of Livestock (y) in a Simple Random Sample of 64 Villages 
Selected from the Population in Table 4.1 


Serial Area of ni ni 
Noof Village п; Ў уҹ A» 
Class Cay * і i 
1 63-73 2 16 256 
2 155-33 5 333 26895 
3 245-68 18 1810 281314 
4 344-40 16 1991 330113 
5 491-56 13 1815 287079 
5 767-49 10 2352 605510 
"a E 1531167 
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use the method of linear regression to estimate the livestock 
population and its variance under each of the following two 
assumptions: 

І Р(у|х)=ух 


and, П. У (y |х) = constant, say y; and sampling within each 
class is with replacement. 


Method I 


The relevant formule for the regression estimate of the popu- 
lation total and its variance are given by (63) and (85). To 
evaluate these we require the values of Хи» Јар, 5ша?, Swy? and the 
estimate of B. The calculations leading to these values are given 
in Table 5.2. From this table we obtain 


zn Total of col. 5 
• — Total of col. 7 


— 294-24 


5 .. Total of col. 9 
Total of col. 7 


= 102.72 


k 
2 wi! (x, — 239p 


WE 


— Total of col. 16 
Total of col. 7 


_ 7952 
= 127297 


= 29131 
Soe = 170-7. 
k 


— - ni 
— Ww; E 
Ж MIELE 
Iz 


$03. —_ i 


wy 
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_ 4646-4 — 2880-2 
i -27297 


— 1766-2 
-27297 


| 


6470-3 


|| 


Sov 80-44 


2 wi! Fay ба — ы) 
i=1 — ЈЕ a 
бру" 
_2314-6 | 

(13731) (27297) 


8479-3 
13731 


| 


6175 


k 
" = Wis (x; 7 Хо) 
= ae 
2314-55 
7952. 


І 


"2911 
Hence on substituting in (63), we have 
Yo = 319 [Po + Ê (Хи — 5„)] 
= 319 [102-7 + -2911 67-5 — 294-2)] 
= 319 [102-7 + -2911 (73-3)] 
— 319 [124-0] 
== 39556 
Also, on substituting in (85), we get 


2 (6470-3) (1 — :3813) gea 
ви, P (Ta) = CREE NODI €. 1 + 29131 
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= 101761 x 51-1136 [1 + -1844] 
= 101761 x 60-5389 


4 тя 
= 6-27 


“кү, = 190 М@ 54 _ 7% 
о 9.5. Ты 


Method П 


The relevant formule are given by (19) and (86) respectively. 
We have 


| 
= 
з 
o 
л 


E ni 


2 2 Уу Оч — X,) 
= uvm -— —— 
2 n (x; — ха 
іші 
— 644382 
77 2455455 
0-2624 


| 


Hence on substituting in (19), we have 
Y, = 319 [129-95 + 0-2624 (367-5 — 389. 1)] 
= 319 (124-28) 


= 39645 
Again 
k ni 
2 2 f —прг 
56 == SES TA 
"wy 


n 


_ 1531167 — 1080768 
= 64 


= 7037.5 


апа 
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k 
2 пх; 


i=1 
2455454-7 
64 


38366 48 


k 
DS nj (а — Fn) 


Swys = — = 
n 
NONE 
64 
= 10068 -47 
d e n) Soy Es Sane = Du = ду“ У қ. 


п 252 (10068-47): 
= 1031.5 — 38366-48 


= 43952 


Hence, on substituting in (86), we get 


Compared to the value of 60-54 of the sampling var 
estimated mean obtained by Method 1, Method II i 
a value of 71-75, which is larger by about 20%. 
traced to the rather large variability among os, vi 


V(Y) = 319? 


n Sus Y, = 


62 
= 319? x 70-89 x 1:01212 


101761 x 71:75 


|| 


100\/71-75 _ 6. 
2453 —— өн 


4395-2 . (367-53 — 389-09)" 
7 ( + 77-38366:5 ) 


іапсе of the 
s seen to give 
This must be 
ide Table 4.1. 
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TABLE 5.2 


Computations Leading to the Values of the Regression Estimate 
of the Livestock Population and its Variance 


Serial 
No. of n М тҗ(М—1) N;-n ШЕЛ xi "e ni 
Class 
() о 3) (4) (5) (6) 7) (8) 
1 11 20 9 2-2222 63-73 +03487 8-00 
2 5 48 235 43 5-4651 155-33 +03518 66-60 
3 18 84 1494 66 22-6364 245-68 -09214 100-56 
4 16 60 944 44 21-4545 344-40 +06230 124-44 
5 13 77 988 64 15-4375 491-56 03141 139-62 
6 10 — 39 380 29 13-1034 767-49 01707 235-20 
Total 64 319 80-3191 :27297 . 


Serial 


ni 
2 p (10) , z X 
(им OH Zw 58 буу xoxo nas Qu ну Qu X 
(9) (10) (11) (12) (13) (14) (15) (16) 
1 0-2790 256 128 455 -230-51 - 64:31 53135 1853 
2 2:3430 26895 5379 189.2 —138:91 —325:47 ` 19296 679 
3 9:2656 281314 15629 14401. — 48-56 —449.94 2358 217 
4 7:7326 33013 20632 1285-4 50-16 38887 2516 157 
5 43855 287079 22083 93.6 197-32 865-35 38935 1223 
6 450149 605510 60551 1033-6 473-25 1900-05 223966 3823 
Total 28.0406 4646-4 2314-55 7952 


5.8 Comparison of Simple Regression with the Ratio and the 
Mean per Sampling Unit Estimates 


Tbe sampling variance of the 


i simple regression estimate of the 
population total to terms in 1/n 


has been shown to be 


1 — ns 
VQ) = мәр l= (95) 
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The sampling variance of the ratio estimate under comparable 
conditions of sampling with replacement is obtained by dropping 
the finite multiplier from (28) in the previous Chapter, and will, 
therefore, be 


N*. s 2,2 
VQ) = => (0,2 — 2Еураџо, + Ко) (96) 
while that of the mean per sampling unit estimate is given by 


(Мы = Те 97) 


Comparing first the simple regression with the mean per sampling 
unit estimate, we notice that the regression estimate is always 
More accurate than the arithmetic mean estimate. Comparing 
next the simple regression with the ratio estimate, we observe 
that the former is more accurate than the latter if 


о,“ — 2 Куроуо, + Куга > о," — oy) p 
ће, if 

Po? — 2 уроо, + Ко? > 0 
Le., if 

(ро, — Ryo; > 0 (98) 
which is always true, unless 


Ку = р = 


о, 


a 


Hence the regression estimate is always more accurate than the 
ratio estimate unless the regression of y on x is a straight line 
passing through the origin, in which case the two estimates will 
have equal variance. 

4.2 provide 
Substituting 
ble 4.1 in 


The data for 319 villages referred to in Example 
the material for the comparison of the two methods. 
for values of o 2, p, оз? and Ry from col. 8 of Tal 
formulæ (95) and (96), we obtain 
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PD) а оу (1 — р?) 
KFOR Е са 2Куроуо, + Куго 2 
_ 4416-3 
~ 4416-6 


== (1 


There is thus little to choose between the simple regression and the 
ratio methods of estimation, since the regression of the number 


of livestock on agricultural area is almost a straight line passing 
through the origin. 


5.9 Comparison of Simple Regression with Stratified Sampling 


The regression method of estimation achieves the same purpose 
as stratification by size of the sampling unit, namely, 
the effect of variation in the size of the sampling u 
standard error of the estimated character. 
two methods is, therefore, of interest. 


We have seen that the sampling variance of the estimated total 
in stratified Sampling is given by 


k 
{ ) e М,—т 652 
N? LEA аны. 054 
Ps М, ni 

іші 


This, however, is not the a 
the variance of th 


to eliminate 
nit from the 
A comparison of the 


stratified Sampling, th 


replacement. This is directly obtained from (87) of Chapter ІШ, 
being given by 


2 k i 
ШЕ e» 
іші m à 


In simple regression we further assu 


me that the true regression is 
linear, with constant residual varian 


ce. If the os, therefore, are 
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approximately constant, say equal to oy”, the expression (99) will 
reduce to 


М0,2 =] 
This is the variance which is comparable with the average variance 
of the simple regression estimate, namely, 


мж a -( d J (101) 


Now the value of су? (1 — p?), as it represents the residual variance 
about the regression straight line, can never be less than ow”. 
It follows therefore that the stratified sample will, in general, 
furnish a more accurate estimate than the simple regression 
method. The relationship between y and x is also not always 
found to be linear in practice, in which case the efficiency of 
the regression estimate is further reduced. For, while stratified 
Sampling with suitably chosen strata can take care of any type 
of relationship, the regression estimate can eliminate only the 
effects of the linear component of the relationship. Stratified 
Sampling has an added advantage, in that the estimate for this 
Procedure is an unbiased estimate for any type of relationship 
between x and у, and for any size of sample. It would, therefore, 


Seem that provided the population is divided into an айа 
lumber of strata, so as to make с? small we may b 
Stratified sampling to be superior to the regression method o 


es ; I m 
"mation under most practical conditions. 


5. 
10 Double Sampling 


5 р re- 
жы regression estimate of the population а Ші 7 Ae 
PPoses that the population mean of % namely Хе 15 


р XNSVeT, Ху is not always known although in many cia T e: 
tmm ole io estimate it from a second sample of the popu ae 
1 Out appreciably adding to the cost of the Шш il 
Bi cedure is known as double sampling. 18 this section, ee 
it ve the form of the regression estimate in double sampling 

$ sampl g 


ing variance. 
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Let, as previously, the population be divided into k classes 
with №; units having the value X; each (i= 1,2, ...,К). We 
shall suppose that п denotes a simple random sample of N on 
which both у and x are observed, and that in repeated samples 
OD 7L This Mes sses пк units are drawn from their respective classes 
with replacement. Further, we shall suppose that a second random 
sample Q' is drawn from the remaining № — п, say №, units 
of the population and that only x is observed on this sample. 


Now if X, were known, then the simple regression estimate of 
Yn would be given by (19), namely, 
3X =, +b Gg —z) (102) 
where 


k 
2 тӯ, Оң — x) 
LIA 


b 


i == 
Хп(ж- Fa)? 
іші 


This сап be rewritten as 
ie Nt Б 
9 =n tb iy Gy —x) (103) 


Since Q' is a random sample of М”, xq, provides the best unbiased 
linear estimate of Хи. Hence 

= Б М Ж 

Уа =In + 5 y — x) (104) 
Where ja, denotes the estimate of Jn 
n and Q'. 


It is easily Shown tha 
write 


based on a double sample of 


t Jas is an unbiased estimate of ж» We 


d үс бз Ду 
Е ба |n, m, эль 0) = E) + у 26 Ge —Х)) (105) 
Now from (31) and (32), we obtain 


E (Pn) =a + px, (106) 
and 


E) =P (107) 
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. 
Also 


Sees {5 т, (х; ЈЕ b Огх x] 
Q {2 пе (о == 2 о: 


where Q;' represents the number of units in the j-th class in the 
sample Q', 


. : > 
Е p Un б! Fu кезді ЖЕ ph тп), О X) 2, of 


ізбӛші 


k 
3 т Ооң — X 
ігі 


M Ју (х а Удаа – 3) X 
= j=! 


ізбіші 


т а 
à т (2, — Хај) 
іші 


Substituting 5, 


а + Bx;, we get 


Д "Pad 
ХУ п (а + Bx) (s = 5) 2 Муж 
E (big) = (ағы — EE 
> т is — XY 


= Pty (108) 
On substituting from (106), (107) and (108) in (105), we get 
aN o 
Е (Jia, | thy ћу oes: а 0) =F Ва N B Gy — X) 
= a+ B£, +B (Xy — Xn) 
= а + Xy 
= y (109) 


thus showing that jg, is an unbiased estimate. of Jy. 
15 
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То obtain the variance, we write 


VG) = E [5 EF ым a mor in} ] 


|| 


EG. + gs ESE Ge — y] 


N ces [ы = 9 
+2 Е |а) — је 


Now 
= 1 Е 2 
EG) -Е(52 nj, 
па 
1 Б k 
= 3 aly пёу, 2 + Py AND Ў 
= ізбігі 
Басра м ic + Ju?) + TS 
із ік 1 
Lt : 
== d А 
zuo 
ізі 
— + @+ fs, 
Next 


E (b?) = B +E(b— ву 
On substituting from (32) апа taking expectations, we get 


E) pc ы г 
Zno — 3,2 


EG) = 424) rmn 


2 "x, — xy 


and 


(110) 


(111) 


(112) 
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= 1 — E |} г n? Pus * (xi =) 


пп (x; — ¥,)* i£ 
k 
+ 2 пт Ўл o — 2 


ізбіті 


1 ^ p n? (Е F in?) (à — X) 


nd n, (x, — Хо“ 


k 
+ d ny, Gi — ә 


ізбіні 


1 LET 
= н пў 
n$ n а 9 
s (5 пун, (%- 3) 
к?» k 
== n; (a + Bx | 
n 45 n (x — ХУ ip 


42 n, (а + Bx) би — 3l 


= (a + B&B 
= E(j,) ЕС) 


Substituting from (111), (112) and (113) in (110), we therefore 
have 


(113) 


y 
V On) = % + (a + B) EM ae ТЕРІ 


x E (Rar — X 


+2 (а + BR) В Еа) — I? (114) 
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Lastly, қ б 
Е (Хх ae x -Е (хау — Хр Хг — xy 


= Е (Хо = Хи НЕ (Ew = X 


E = WS ж. = 
E Gy — беј + rs би — 5)? 


1 1 МЕ „_ = уо 


where 
а 
Say = Wo pz (x; — X)? 
іші 
Also 
М” 5 =: " 
N Е (Хо — Х) = ү Cv — X) 
= ў — #, (116) 


On substituting from (25), 
we get 


EE эшш 2-5» үз 
V (Paa) Tp = RW e 
| Ба maar (g = ујак 


x A p (117) 
2 n, (x, — x, ne 


For N and N’ large, (117) reduces to 


V (ya) = i + Gy RE | 


(115) and (116) in (114) and simplifying, 


L 
2 n (x, — x 


ізі 


"els y 
"P M = (118) 
2 т (а-а 
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It will be seen that the variance is composed of two terms. 
The first term is clearly the variance of the simple regression 
estimate when X, is known, while the second represents the 
increase in the variance of the estimate when X, is determined 
from a sample. When Q' — N' or, in other words, when Xy is 
known, we are left with the variance of the simple regression 
estimate. 


The variance of the estimate jg, is seen to depend upon the 
Xs in the sample. Its average value is easily deduced following 
the method adopted in Section 5.4. Thus, for large N and 
ignoring terms of order higher than 1/7, we have from (42), 


"am Gs —3 |, | (119) 
Ўто)" 
Also 
5,2 7 
Е E а id — | сете (120) 
L Xn, (x; -— 9л 


On substituting from (119) and (120) in (118), we get 


1 


26,2 
Е (И 849) 2 7 Е m 2: + öl ge (121) 


МУ. 
Q' 

For large N, an alternative and more efficient estimate than 
the one given by (104) can be formed. This is defined by 


D'as — Y, + b (Хы — 39 (122) 


It is easily shown that this is an unbiased estimate of the 
Population mean, for, 
E (94) = (а + Ву) + BE (база — X9) 
= a + Bx, +B би — X) 
= a + Pity 


= 123 
=, (123) 
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To obtain its variance, we write 
Уа) = EG, — а — Bën +b (ка + „ — EY 
= Ел — а — Bx, + (b — В) (Sy, — €) 
+В (Satin — En)? 
= EG, — а — Bx? + Е{(Б — В) (Коча — x 


+ ВЕ Sarin — Ey)? 

=: + ye [Sena | + РЕ (Farin = Sy)! 

: (X; — S 2 

РС (%; X.) 

(124) 

If we assume a normal distribution for the x's, we have, from 
(43), 


E (Жазы — XE 
2 п, (x, —x, 
іші 
1 


Е-Е |1 


E 


lam (а-а | 


= C 5 m) PESE (125) 


By a method similar to that adopted in Section 5.4, it can, 
however, be shown that (125) will be true in general even without 
assuming normality, provided terms of higher order in 1/n are 
ignored. Hence 


|| 


Убаје 2314 6,217, p 

n) noc О'+п n— з) + Q' а my 
which is seen to be smaller than (121). The expression was first 
given by Cochran (see Jessen, В. J., 1942), 


The estimation of the variance of either estimate presents no 
difficulty. Thus, to estimate the variance of ja. we note that 
Q' 
2 = 2; (4- xq)? 
Sg! = Qe 
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is an unbiased estimate of S*;,,, from (112) Ы is an unbiased 
estimate of. 


gx 


T п, (х — #,) 
i=1 
from (115) 
ы s МУ (б, „ха 1 1 
Est. (Ху — €, = т? (би =>) — (2 - v) se] 


and the estimate of у is given by (36) in Section 5.3. On making 
the substitutions in (117), we get the desired estimate. 


Similarly, we have from (126), 


m S nb 
Est. (Уз) = түң —2) |; 23 О + п =) 


+ TT 1-95) (127) 


5.11* Successive Sampling 

In using the method of regression we have so far assumed that 
ancillary information is available for all the units in the sample 
for which the character under study is observed. This is not, 
however, always the case and the formule given above need to be 
extended to make use of the information contained in additional 
units recording only the character under study. This is the case, 
for instance, when the same variate js observed on two successive 
occasions in time, there being in the units observed some which 
are common to both occasions and some which are exclusive 
to each occasion. The observations on the earlier occasion are 
used as ancillary information to improve the estimate of the 
population value on the second occasion. We shall assume the 
population to be large. 

Let the character be observed оп О’ + т 
Occasion and л; + n, units on the second occa 
are common to the first occasion. We may; as in (122), form 
а regression estimate of the mean on the second occasion based 
on the п, units, viz., 


2-0, +b Ges, — 8) 


units on the first 
sion, of which 71 


(128) 
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where y denotes observations on the ѕесотїй occasion and x 
those on the first occasion. Another independent estimate of the 
population mean on the second occasion is available on the basis 
of па units observed on the second occasion only, viz., ју Any 
unbiased linear combination of the two estimates can be written as: 


y—-—J2z-y,, (129) 


where ф is any positive number less than 1. To obtain the best 
overall estiniate, we may choose ф so as to minimize the variance 
of (129), i.e., minimize the expression 


=? Y @ + У У(у„) 


We have easily 


WV бы) = (1—0) VE (130) 
Thus 
= V (2) 
МЕ Ур) + У) (131) 
where, from (126), 
Nos! «n Ода 2l PS, 
И (2) ғы т (1 — р) | 29 Qm NT 3| => OF m 


and 
2 
VG) = S 
Ng 


2 When information is available for more than one previous 


nt occasion can be worked out 
on precisely the same argu- 
ons. Denote by ), the best 
mean on the /-th Occasion, 
casion. We write 

Ж = 0 — dy) 2, + V ys, (132) 
where у 


„ = 77 F br, за (а # 


ки 


n'h 
Jn, = mean on the k-th occasion 


of n, 
the (h — 1)-th occasion 


units common to 


ль 
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Zv, = mean of the same n’, units on the preceding occasion 


Jar, == mean on the /r-th occasion of n", units not common 
with the (Л — 1)-th occasion 


ba, ъл = regression coefficient of values of the h-th occasion on 
values of the (Л — 1)-th occasion based on the n’, 
common units, 


LXyG-E 
X ( ind 
As before it will be seen that Zp, is the regression estimate of the 
population mean иһ on the h-th occasion, using observations on 
the (5 — 1)-th occasion as ancillary information. Another 
independent estimate is provided by Fn”, and we choose Yp 50 as 
to minimize the variance of the above combination. For presenta- 
tion of the exact theory it will be assumed that units in the 
sample are common only between consecutive occasions, but the 
results will,'in practice, be still true if it can be assumed that 
Observations more than one occasion apart do not provide any 
additional information. Minimizing the variance with respect to 
Уһ, we have 


VV Gun) = Q0 А) VG) (152) 
Now, using (124) and putting у = Sp? (1 — p'n,n-i), We write 


А wh 


Vi) e X ok Gua SY | gus LE Gua ња)? (134) 
A (к=) 


Assuming independence of 3 (x — Xy, and (ўпа) which 
Will necessarily be true in case of normality, we have 


5,2 (1— p, һ-а) 
n 


VQ) = 


Sen.) кгз, а бај + Ваља Иа) 099 
Sa (i, — 3) E (б, 1—1) h,h—1 һ 
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But. 
Е (yy — Ка) 
= Va) + Sa = 2 Cov (у X, Хи) 
Sha Sia 136 
ra tra is (136) 
since 


2 
Соу ry Bey) = Y. = tha pe (137) 
The validity of (137) follows from the well-known result that the 
correlation between any unbiased estimate and an efficient estimate 
tends to the square root of the ratio of their variances. The 
approximation involved will affect only terms of order (пъ) in 
the expression for V (Zp). Тһе result is, however, exact if units 


are common only between two consecutive occasions, for, we 
have in that case 
Cov (Fra Хи 
= Соу [((1 — Jua) 2-1 + fhr- Dayi h Хе al 
= Соу (Fy, Хи, since Cov (2, |, Xy) = 0 
= tha E (ur ^y — Pra) (а — Ји ey Ди ал) 
= thal Fyn ^i) 
since 
Е (ў, 7) Gu, — Trikes) О 
Thus 
Соу (7, A S*,, 
V r-i х, = Pha п" (138 ) 
h-1 


Hence we have, from (133), (135), (136) and (137), 


2 
h >, = 0-49 [9907-9 | Ба gia) 
UE n, 


ny c 


x iz М = а п" ay —} + p*, ћу 1-1 912 2a] (139) 


m! 
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Writing 


” ag m 1—p5, һ-1 (140) 
P “n, пл = Руља i, = = 5 


equation (139) can be expressed as 


(1) [7 [S NS болан e. ES (41) 
n'y n'ya n^, 
which provides the recurrence formula for the evaluation of уф 
(remembering that y, = 1) and hence of the estimate (132). In 
practice, the correlation coefficients pn, һ-і may have to be evaluated 
from the sample. The recurrence formula (141) is due to Narain 
(1953). The general theory has also been investigated by Yates 
(1949), Patterson (1950) and Tikkiwal (1951), but the regression 
coefficients B. ., have been assumed known in their approach. 


Finally, we shall consider the question of replacement. What 
fraction of the sample should be replaced on each occasion in 
order that the estimate on the current occasion may have the 
maximum precision ? We shall first consider the case of two 
occasions and assume that: (а) the total size of sample m + 7 
is fixed on each occasion, say equal to M, and (b) n, is sufficiently 
large so as to neglect terms in ту in the expression for V (2). 
Substituting for у from (131) and for У (2) and V (4) in the 
expression for the variance of ӯ, we obtain 


V() = (1— pE И (2) + РК А) 


У) _ wey 
+ (vey Уб) 09 


_ VIVO 
У) YO 


° eu BI ж 
" 5 (1 м”) (142) 


(к=) 


236 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


Clearly, the optimum value of the fraction to be replaced is 
obtained by minimizing V (y) with respect to n/M. Differenti- 
ating with respect to n/M and equating to zero and noting that 
(т + 14) M —1, we obtain 


DN 1 А 
М 1+уг è ш 


It will be noticed that the fraction to be replaced depends upon 
the value of p. The larger the value of p, the larger is clearly 
the fraction to be replaced. п,/М attains the minimum value of 
+ when p =0, showing that the fraction to be replaced should 
always exceed 1 provided, of course, cost and practical considera- 
tions warrant such replacement. For moderately high values of 
p like >5 to -7, the optimum value of the fraction to be replaced 
works out to about 3/5 of the size of the total sample. 


Now, in general, on the h-th Occasion, the variance of Jn will 
be seen to be given by 


V (у) =(1— Ju V (21) ДЕ ТЫ ЖЕУ) 


- =a? r5 v6) + Hat Y Gu) 


by virtue of (133), i.e., 


= th, FA (144) 


Let h be sufficiently large so as to justify the use of the limiting 
value of уһ obtained by putting Jy = Jp in (141), and further 
let n'n = n' and n'y —n" for all h. Differentiating (144) with 
respect to 2"/M, where M —n' + n", and using the limiting value 


of #, viz., 
bul n HUNE NE D 
R aan + a/a p nba пр | 
2 E p 
we obtain 


E 
мМ 72 (145) 
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thus showing that the replacement fraction to be used after a 
sufficiently large number of occasions is 4 (Tikkiwal, 1951). 
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CHAPTER VI 
CHOICE OF SAMPLING UNIT 


А. EQUAL CLUSTERS 
6a.1 Cluster Sampling 


A sampling procedure, as pointed out in Section 1.4, presupposes 
the division of the population into a finite number of distinct and 
identifiable units called the sampling units. Thus a population of 
fields under wheat in a given region might be regarded as 
composed of fields or groups of fields on the same holdings, 
villages, or other suitable segments. A human population might 
similarly be regarded as composed of individual persons, families, 
or groups of persons residing in houses and villages. The smallest 
units into which the population can be divided are called the 
elements of the population, and groups of elements the clusters. 
When the sampling unit is a cluster, the procedure of sampling 
is called cluster Sampling. When the entire area containing the 
population under Study is subdivided into smaller areas and each 
element in the Population is associated with one and only one 
such small area, the procedure is alternatively called area sampling. 


For many types of Population a list of elements is not available 
and the use of an element as the sampling unit is therefore not 
feasible. The method of cluster or area sampling is available in 
such cases. Thus, in a city a list of all the houses is readily 
available, but that of persons is rarely so. Again, lists of fields 
are not available, but those of villages are. Cluster sampling is, 
therefore, widely practised in sample surveys. 


The size of the cluster to be employed in sample surveys 
therefore requires Consideration. [n general, the smaller the cluster, 
the more accurate will usually be the estimate of the population 
character for a given number of elements in the sample. Thus, 
a sample of holdings independently and randomly selected is likely 
to be scattered over the entire area under the crop, and thereby 
provides a better cross-section of the population than an equi- 
valent sample, i.e., a sample of the same number of holdings, 


——— Sa 
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clustered together in a few villages. On the other hand, it will 
cost more to survey a widely scattered sample of holdings than to 
survey an equivalent sample of clusters of holdings, since the 
additional cost of surveying a neighbouring holding is small as 
compared to the cost of locating a second independent holding 
and surveying it. The optimum cluster is one which gives an 
estimate of the character under study with the smallest standard 
error for a given proportion of the population sampled, or more 
generally, for a given cost. In this chapter, we will give the 
relevant theory which can provide guidance in the choice of 
a sampling unit in a sample survey. 


6a.2 Efficiency of Cluster Sampling 


We shall first consider the case of equal clusters and suppose 
that the population is composed of N clusters of M elements 
each, and that a sample of л clusters is drawn from it by the 
method of simple random sampling. 


Let 


Уй denote the value of the character for the j-th element, 
(j — 1,2, ..., M) in the i-th cluster, (# = 1, 2, 
зеі МЕ 


Ж, the mean per element of the i-th cluster, given by 


M 
| im а) 
Ne ag. Уч 
іші 


the mean of cluster means in the population, 


N. 
defined by 
1 N 
in. gu 0) 
M. the mean per element in the population, defined 


by 


1 N M 

3; es Т 3 

У = NM + 25 Ju (3) 
іші ј=1 
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and 


Jn. the mean of cluster means in a simple random 
sample of n clusters, given by 


A. = i)» (4) 
n 


= the mean per element in the 
sample. 


Clearly, n, is an unbiased estimate of y. , and its variance is 


Уб) = М 5, 5) 


where 


1 N 
Br = ape Y On — Fy.) (6) 


| 


the mean square between cluster means in the population. 


If an equivalent sample of 5M elements 


selected from the population, the variance oft 
would be 


ы NM — ә 
V бм) = ИИМ. Бу (7) 
where 


N 


iml ј=1 


Now the relative efficienc 


their variances. It follows that 
sampling unit compared with { 


| 
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У (Там) 
Е = пм. 
VG.) 5) 
УМ —пм S 
m NM nM 
N—n а 
Nn 5, 
S? 
= 10 
MS, 9 


If we set up an analysis of variance for elements in the popu- 
lation, as shown in Table 6.1, this efficiency will be seen to be 
€qual to the ratio of the overall mean square between elements 
to that between clusters in the population. 


TABLE 6.1 


Analysis of Variance 


Degrees 
Source of Variation of Mean Square 
Freedom 
» N 
A 3 
Between clusters кт N-1 — > (ў. — Ум) = MS? 
N-1 
іші 
N M 
iA 1 Еј 
Within clusters % N (M—1) » ) Q—9,) = 5," 
М(М—1) 
іші іші 
м м 
1 „ы 
Total population РЕ NM-1 — ) ) Оңу—Ўм.)% = 8° 
NM-—1 
* іші іші 


Example 6.1 


To show how efficiency changes with the size of a cluster, we 
give a numerical example from data relating to the use of 
Clusters of different sizes in estimating the area under wheat. 
Table 6.2 gives values of the mean square between survey 
numbers in a village (S?) and the mean square between clusters 
(MS?) pooled over 11 villages in the Meerut District (India). 


16 
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The clusters were formed by grouping consecutive survey numbers 
in a village. The character studied was the area under wheat. 
The mean squares are given separately for clusters of size 2, 4, 
8 and 16 survey numbers. The last row of the table gives the 
values of the efficiency obtained by dividing the mean square 
between survey numbers by that between clusters within villages. 


TABLE 6.2 
Efficiency of Clusters of Size 2, 4, 8 and 16 Survey Numbers 


Size of Cluster 
Mean Square (Acres)? 


2 4 8 16 
Between clusters within villages .. 138.6 180-7 245-1 333-9 
Between survey numbers within villages 108-3 108-3 108-3 108-3 
Efficiency ES әд a 0:78 0:60 0-44 0.32 


It will be seen that the efficiency decreases rather rapidly with the 
increase in the size of the cluster, clusters of 2 being only about four- 
fifths as efficient as individual survey numbers, those of 4 about three- 
fifths, while those of 16 are only one-third as efficient as individual 
survey numbers. In other words, the sample of clusters of size 16 
will have to be three times as large as the sample of individual 
survey numbers in order to give an estimate of equal precision. 
If clusters are random samples of M from a population of NM 
elements, and consequently composed of elements which are not 
more alike than those of other clusters, then the mean squares 
between and within clusters will behave as random variables and 
their expected values will each be of the same order. 
For, 


E (Mean square between clusters) 


N 
1 
= Ж = мо, — y 


N 
M = 2 5 
= pel » Е (2) — "vl 
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Now substituting from (35) in Section 2a.5, we have 


E (Mean square between clusters) 


=a 5° (11) 


Similarly 


E (Mean square within clusters) 


= NU 7) » d АИ ID 


тот IM Gv + M 


| 
| 


2.4 NM -M SN) 
— MN (x? + NM м) 

= 53 (12) 
It follows that if clusters are random samples of the population, 
they will, on the average, be as efficient as the individual elements 
themselves. 


6a.3 Efficiency of Cluster Sampling in Terms of Intra-Class 


Correlation 


In practice, a cluster cannot be regarded as comprised of а 


random sample of elements of the population. Usually, elements 
Of the same cluster will resemble each other more than those 
belonging to different clusters. Thus, the variation in yield 
between different portions of a field will tend to be less than that 
between different fields. Consequently, the variance of an estimate 
based on cluster sampling will ordinarily exceed that based on an 
equivalent sample of elements selected independently. The manner 
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in which the variance of the estimate increases with the size of 
the cluster can best be elucidated with the help of the concept 
of intra-class correlation between elements of a cluster. 


Let p denote the intra-class correlation defined by 


- EQ — In.) Ош— Fn} 
4-3 Е (Ya — In. Y Е ~ 


The numerator in (13) can be written as 
E (Qu — Ji. ЊУ, — Ун) Ош - Fi. Ji — Fy. )} 
= E (Qu — Pi.) Ou — Ж.) + Оң — Fs.) Gi. — Эм) 
+ Oy =.) Oi. — Js) + 0, — Y) 
= E (wy — Fi.) О — J.) + EO: — ўм.) (14) 


since the expected value of the two middle terms is clearly zero. 


To evaluate the first term of (14), we first work out the expectation 
for a given i. We have 


Еу); =.) Vin — Pa.) | i} 


M 
1 
= MM-i Py =F 90 


ізекті 


+ ни d = 2 ony | 
® jzi 


3 MS (0-(м-1) se] 


(15) 


Next, taking the expectations for varying i, we obtain 


| 


СУ 


E{Qy—3..) ба) = — »» i 


5,2 
M (16) 
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where 


D 
5, 


The values of the second term in (14) and that of the denominator 
in (13) are known, by definition, to be (№ — 1) Sy'/N and 
(NM — 1) S [NM respectively. We thus have 


E 


"NS Nu cU (7) 


When clusters are randomly formed, the expected value of Sy? 
and М5, will each be equal to S°, and 


LLL, m 18 
p= теті (18) 


Now, by definition, S°, Sy?, and §,,? are related to each other by 
the identity 
(NM — 1) 5° -(М- 1) м5 + N (M —1)82 | (19) 


Hence, eliminating first $,? from (17) and (19), we have 


ЖЕТ =. 20 
ва = г) TII 0р} (20) 


and next eliminating Sp? from the same equations, we get 


5° (1 —p) (21) 


Un 
5 
Г 


The variance of the mean of п cluster means is now obtained by 


substituting from (20) in (5). We have 


М= п. 8, 
N n 


Nn, МУМ S ES 22 
=y ma ђ uU С, ye} e 


И On.) = 
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and from (10) the relative efficiency is given by 


| M(N—1). 1 


SUMI пета 


Q3) 


For N large, the formule (17) and (19) to (23) can be approxi- 
mated by the simpler expressions 


+ 82 
ee (24) 

pere 
sas? + Mots: (25) 
s? = у {1 +c" — 0р) (26) 
5.2 = S*(1 —p) (27) 

= N-n S ) 
V6.) e y * nag || + — 0] (%) 
апа 
1 

PE ae ТІПТ 29 
PE IUE TUS e» 


Formula (22) for the variance of the mean of n cluster means 


can be expressed more simply by introducing an alternative nota- 
tion. Let the variance of the mean of a single cluster be denoted 
by оъ?, so that 


5 М—1., 
coe N S, (30) 
that of a single element in a specified cluster by w°, so that 
+ _ M = Пана 
of = БО (31) 


and that of a single element chosen from the population by o, 
so that 
‚ _ NM-I 


ga са 


NM (32) 


CHOICE OF SAMPLING UNIT 247 


On substituting for 82; Sy? and S? from (30), (31) and (32) in 
(17), (19), (20), (21) and (22), we have 


rad age e» 

0 = о +o (34) 

= ја fi +(M —1) Р} (35) 

xw its Гати (36) 
апа 

У) = = of +M — 1) 2 (37) 


The formula for the variance of the mean of n cluster means 
in this form was first developed by Hansen and Hurwitz (1942). 
It is made up of three factors. The first factor is the finite 
multiplier (N — n)/(N — 1). The second factor is the variance of 
the mean based on nM elements selected with replacement at 
random. The third factor measures the contribution to the 
variance of cluster sampling. If M = 1, this factor is unity and 
we are left with the finite multiplier and the sampling variance 
of the mean per element based on nM elements selected inde- 
pendently at random. If M is greater than 1, (M 
therefore, measure the relative change in the sampling variance 
brought about by sampling clusters instead of elements. In 
practice, p is usually positive and decreases as M increases, but the 
rate of decrease is small relative to the rate of increase in M, so 
that ordinarily increase in the size of a cluster leads to substantial 
increase in the sampling variance of the sample estimate. 


_ The point can be well illustrated with tbe help of data given 
in Table 6.3. Table 6.3 gives values of p and (М— 1)» 
calculated from the data in Table 6.2. It will be seen that p 
decreases as M increases as expected, but the rate of decrease 
in p is slow as compared to the rate of increase in M. For 
clusters of size 16, the relative increase in sampling variance 1s 
seen to be a little over 200%. 


—1)p will, 
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TABLE 6.3 ' 
Relative Change in Variance with Increase in Size of Cluster 
Size of Cluster (М) 2 4 8 16 
p т T .. 0:28 0-22 0-18 0-14 
(M-l)p .. NS 4 0-28 0:66 1-26 2:10 


Although р is usually positive, there are situations where it can 
also be negative, as, for example, in the problem of estimating 
the proportion of males in the population using the household 
as a sampling unit. The intra-class correlation between sexes of 
different members of a household is clearly negative. Consider, 
for instance, households of size 4, consisting of a husband, 
a wife and two children. The households can be classified into 
three classes: those in which both children are males, those in 
which one child is a male and the other is a female, and finally, 
those in which both children are females. On the assumption 
that the proportion of male children is one-half, we would expect 
both children to be males in 25% of the households, one child 
to be male in 50% of the households, and no child to be male 
in the remaining 25%. These relative frequencies p; (i = 1, 2, 3), 
together with the values of the proportion of males in the 
different classes viz., Ji, are presented in Table 6.4. In this 


TABLE 6.4 


Relative Frequencies and Proportions of Males 
in Households of Size 4 


Descrip- Fre- Proportion { ы 
Class _ tion of 2 u=? Оа)? g 
Household 9520 засаа mue Т. 
1 2male 3 i 
children У * ж * 
2 male 4 4 
child 4 * ы à 
3  Nomale i 
children : у us e 
Total population 4 + "m 44 E 
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table уд = 1, if the j-th member in the i-th class is a male, and 
уй = 0, if it is a female. 

Since the population under consideration is an infinite population, 


we write 


S = = E {yy — Е (7) 


so 20-9 
= Fh Р; A =ч 
іші 
1 
= 4 (38) 
Also 
В so Хои.) 
8,2 = 0,2 = Ул = 4 
7 
-3 (39) 
and 
S eo? = 7 p — 
и 40 
~ 32 e 


4 4 E ag 5 2 
The values of 1 2 Qu— D5 12 (vig — ў. and (i. — 9)? for 

іті Uer : > 
the different classes, and those for the whole population repre- 
senting o?, ә, and oy? are also shown in Table 6.4. On substi- 
tuting in (33), we obtain 


The result is otherwise obvious also. For, the correlation between 
the sexes of husband and wife is —1, but that for every other 
of the remaining five pairs is zero, since the sex of husband Or 
Wife will not determine the sex of their children, nor will the sex 
of one child determine the sex of another. The average value 
of the correlation between sexes of different members in a house- 
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hold of 4 consisting of a husband, a wife and two children is, 
therefore, — 1. 


The relative change in variance in adopting the household as 
a unit of sampling in place of an individual is, therefore, 
(M — 1)р or —50%. In other words, the household will be 
approximately twice as efficient as a single person for estimating 
the sex ratio. It is, of course, recognized that households will not 
all be of the same size and composition, but results from actual 
observations show that a household will be considerably more 
efficient as a unit of sampling than a single individual for the 
character under consideration. The same situation may hold good 
for characters like the proportion of persons in a family above 
a certain age. In general, however, p will be positive and we may 
expect the variance to increase with the size of the cluster. 


6a.4 Estimation from the Sample of the Efficiency of Cluster 
Sampling 


Data for the complete population are seldom available in 
practice. What is available is only a sample of clusters and the 
analysis of variance of the elements in the sample. The problem 
arising in practice is, therefore, to assess the relative efficiency of 
cluster sampling from the sample data alone. 

Let the sample consist of n clusters, 


- Then the analysis of 
variance for the sample will take the forr 


m of Table 6.5, 


TABLE 6,5 
Analysis of Variance for Sample 
Degrees ЕД 
Source of Variation of 2 " 
Freedom Mean Square 
1 n E | 
Between clusters ká -1 5.5. 3 2 
" i= 2 М Pi 5, = Ms? 
1 n M 
Within clusters oe n(M—1) тежей та 
n(M~1) уздш зды 
i ілі 
г п M 
Total sample Ei nM—1 зас eee B. 
nM —1 Todi us 
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In a random sample of clusters, sẹ? and 5,2 will provide 
unbiased estimates of the corresponding values in the population, 
viz., Sy? and 5,7. 5° will not, however, be an unbiased estimate 
of S?, since the elements on which it is based cannot be consi- 
dered to be a simple random sample of elements from the 
population of NM units. An unbiased estimate of S? is, however, 
easily obtained from (19) by substituting for бр” and Sy? the 
values s}? and 5,2. We thus have 


Eu ка _ б -DMsENOL- DRE 4l 
st. S АЛС (41) 


Hence substituting from (41) for Est. (6%) and writing sẹ? for Sp? 
in (10), we obtain 


Est. (Relative Efficiency) 
_ (N= 1) Ms? +N (M — 05 


(NM — 1) Ms? 
(42) 


for large М. 


Example 6,2 


Table 6.6 gives the analysis of variance of the area under 
Wheat for a sample of 44 clusters selected from 11 different villages 
in the Meerut District (India). Four clusters were selected from 
сасћ of the 11 villages and each cluster consisted of 8 conse- 


cutive survey numbers. Estimate the relative efficiency of the 
ith the individual survey 


a village may be 
Assumed to be large. 


f On Substituting the values from Table 6.6 in (4D, 
Ог large М, 


we obtain 


A " М—1., 
Est. S! œ 5,2 + — ім 


| 


7 
= 2 
+ 3 (112.8) 


252 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 
Hence 


Est. (Relative Efficiency) 


TABLE 6.6 
Analysis of Variance of Area under Wheat 


Degrees 
Source of Variation of Mean Square 
Freedom 
Between villages са 2% An 10 290-1 
Between clusters within villages .. m 33 251-4 = Муу 
Between survey numbers within clusters x 308 112-8 = &,2 
Total .. 7351 


6a.5 Relationship between the Variance of the 


Mean of a Single 
Cluster and its Size 


So far we have considered the relative efficiency of clusters and 
elements as sampling units. The more general problem is that of 


of clusters of any size, given 
of clusters of a particular size. 
the relationship between the 
clusters of given size Sy? and 
made to work out such a relati 
Fairfield Smith (1938). 


He argued that: 


mean square between means of 
M. Several attempts have been 
onship. The first one was due to 


If the cluster were to consist of a random 
55" would be equal to S*IM. 


Owing to the fact, however, that fo 
in practice elements of a cluster 
clusters will differ in their average 
are composed of randomly selected 
ordinarily exceed S?/M. 


Sample of elements, 


T most populations encountered 
will be Positively correlated, 
values more than when they 
elements. 553 will, therefore, 
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He proposed the following relationship: 


Р 5° . 
S = 5% (43) 
where g is a constant, less than 1, to be calculated from the 
sample. He found the relationship to be satisfactory on yield 


data from uniformity trials for different size plots. 


Example 6.3 


Table 6.7 shows the values of the mean squares between plots 
within fields for plots of five different sizes, viz., equilateral 
triangles of sides (a) 33', (6) 25’, (c) 15’, (4) 10' and (e) 5” each. 
The data relate to the crop-cutting survey on wheat conducted 
in Kangra District (India) during the year 1945-46. Altogether 
76 fields were selected for the survey and in each, 10 plots, two 
of each of the above sizes, were marked at random. 


TABLE 6.7 
Yield Survey on Wheat, 1945-46 (Kangra) 


Values of Mean Squares between Plots within Fields 
for Plots of Different Sizes 


Mean Square between Plots 
Size of Plot within Fields 


M (Md./Acre)* 
a Observed Fitted 
471-5 0:51 0-56 
270-6 0-83 0-75 

97-4 1:21 1:26 

43-3 2-14 1-91 

10-8 3-63 3-88 


Examine whether the linear relationship 
log 8,2 = log S? — g log М 
proposed by Fairfield Smith describes the data adequately. 


The equation obtained by fitting the linear curve by the method 
of least squares to the data observed is given by 


log S,2 = 1-117 — 0-511 log M 
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The theoretical values of 5,2 as calculated from this equation 
are given in col. 3 of Table 6.7. It will be seen that the fit 
is satisfactory, showing that Fairfield Smith's law adequately 
describes the data. 


Fairfield Smith's law, however, leads to one logical difficulty. 
On the assumption that the total mean square between elements 
in the population is known and the mean Square between means 
(per element) in clusters of size M is given by the relationship 
proposed above, an expression for the within cluster mean square 
can be derived directly from (19). We get 

S? 
(NM—1) S — (N—1) M "и" 


83 = N(M—1) (44) 


Equation (44) shows that variability within clusters is a function 
of N, the size of the finite population, although strictly it should 
have been independent of it. For N large, however, Sj? becomes 
independent of N. being given by 


M M 
S? = MI 5" (1 — M’) (45) 


where 5° will now represent the total mean Square in the infinite 
population of which *he finite population is itself a sample. 
Equation (45) also shows that if we regard the population itself 
as a single cluster and M is consequently very large, the within 
cluster variance 5,2 will approach S? as expected, 


If instead of assuming the relationship given by (43) we assume 
the one given by (45) which satisfies the condition of S, being 
independent of the size of the population, the expression for the 
mean square between clusters can be written as follows: 


5 І 
БЕ у=) {аум — DES 


— м (M —1) =з а м-ә sj 


5° к 


= MC — 1) 12% (46) 
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which in the limit tends to S?/M9. Sp? now depends on N and 
in fact increases with it when g < 1, which is logical, since the 
variance of cluster means for clusters of a given size when 
clusters are widely separated should be larger than the variance 
observed when they are close together. 


Jessen (1942) showed that although Fairfield Smith’s original 
formula, namely (43), describes the yield data extremely well and 
the refinement suggested by him, namely (46), even improves the 
relationship, most economic characters relating to farm data 
follow a slightly different law. He postulated that the mean 
Square among elements within a cluster is a monotone increasing 
function of the size of the cluster given by 


§,2 = aM" (6>0) (47) 


where а and b are constants to be evaluated from the data. 
The same relationship was independently suggested by Mahalanobis 
(1940). Consequently, assuming this law to hold for the mean 
square within clusters, the expression for the mean square between 
cluster means is obtained as shown below: 


» (NM — DS — N (M —1) aM" (48) 
= MN=) 


The constants S?, a and b are evaluated from the data. For 
this purpose, we require: (1) an estimate of the mean square 
among elements in the population, and (2) an estimate of the 
mean squares between elements within clusters for at least two 
values of M. If we regard the total population as a single cluster 
containing NM elements so that 


S? = a (NM) 
then we have 


вуз — (M — 1) a (NM) — N (M — D aM (49) 
M(N —1) 

The above relationship now depends on only two constants and 

can, therefore, be estimated from the variance among elements 

in the population and the variance within clusters for any one 
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value of M. Hendricks (1944), however, pointed out that the law 
may not hold good for large sizes of clusters. This was also the 
finding of Asthana (1950), who has fitted Jessen’s law to describe 
the mean square within clusters for acreage under wheat for 
a large number of villages. He found that the observed value 
of Sy? when the sampling unit is formed by the whole of the 
population (village) was consistently, though only slightly, below 
the fitted line showing that the law probably did not hold -good 
for large sizes of clusters. When the law was fitted to the within 
cluster mean squares corresponding to sizes 2, 4, 8 and 16 only, 


that is excluding the cluster formed by the whole population, he 
found that the fit improved. 


Example 6.4 


Fit Jessen’s law to the within cluster mean square values for 
clusters of sizes 2, 4, 8 and 16 survey numbers and the one 


formed by all the survey numbers in the village, given in Table 
6.8. 


TABLE 6.8 
Values ој Sy? in Clusters of Different Sizes (Acres)? 


M Observed Values Fitted Values 
2 78-10 81-53 
4 84-28 84-25 
8 88-92 87-05 
16 93-50 89-95 
NM = 1176 108-33 110-22 


The clusters were formed b 


| | У grouping consecutive survey numbers 
in a village. 


The character under Study was the area under wheat. 
Jessen’s law as fitted to these values will be found to be given by 


log 8,2 = 1-897 + 0-0473 log M 
The theoretical values as calculated from tl 


beside the observed values in Table 6.8. 
the law adequately describes the data. 


his law are given 
It will be seen that 
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6a.6 Optimum Unit of Sampling and Multipurpose Surveys 


We have seen that ordinarily cluster sampling will lead to loss 
of precision. On the other hand, cluster sampling helps to reduce 
the cost of a survey. In this section we shall consider the problem 
of determining the optimum size of cluster which will provide the 
maximum information for the funds available. 

We shall assume that clusters are equal in size. The cost of a 
survey based on a sample of л clusters will, apart from over-head 
costs on planning and analysis, be made up of: (a) costs due to the 
time spent on enumerating all the elements in the sample, nM in 
number, including the time spent on travelling from one element to 
another within clusters and costs on transport within clusters, and 
(b) costs due to the time spent on travelling between clusters and the 
cost of transportation between clusters. The cost of a survey can, 
therefore, be expressed as a sum of two components one of which is 
proportional to the number of elements in the sample and the other 
proportional to the distance to be travelled between clusters, i.e., 


С = спм + са 


where c, represents the cost of enumerating an element including 


the travel cost from one element to another within the cluster, 
C» that of travelling a unit distance between clusters and d the 
distance between clusters. It has been shown empirically that 
the minimum distance between п points located at random is 
proportional to nt — ni (Mahalanobis, 1940). Jessen by means 
of experimental work has shown that the approximation ni works 
well in practice (1942). The equation for the cost of a survey 
can, therefore, be expressed as 


С = спм + сы} 
where c; will now be proportional to the cost of travellin. 
distance. 
We have already seen that if the variance within clusters is 
assumed to follow Jessen's law, then the variance of the estimated 
mean per element based on a sample of n clusters of size M 
each is given by 


(50) 


g a unit 


=n „бы: (51) 


5 N 
„у= “у”. 


17 
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where Sẹ? is as defined in (48). Substituting in (51) for S»? the 
value from (48), we have 


јаје 2 Ss = al 52 
Уба, [S — Qt — D aw] (52) 
where the finite multiplier is ignored. The problem is to choose 
n and M so that the variance given by (52) is minimized for 
a specified cost. We will give the solution as it was first presented 
by Jessen and then attempt an algebraic solution. 


The investigation carried out by Jessen related to a survey 
of farm facts concerning number of livestock of different types, 
acreage and yield of corn and oats and income and expenditure 
on the farm. Samples of seven different sampling units were 
taken. The sampling units considered were (1) the individual 
farm, (2) quarter section, Sa, (3) half section, S, (4) full section, 
S, (5) two adjacent sections, 25, (6) 4 adjacent sections, 45 and 
(7) 36 sections, 36S. Using the equation of the cost function, 
Jessen calculated for each of the different sampling units the 
total number of clusters п which could be covered for different 
combinations of two different levels of the total expenditure, 
three different values of с, and two different values of с. The 
two levels of the total expenditure specified were $ 1000 and 
$ 2000 and the values of су assumed were proportional to 1, 4 
and 8 and those of c, to 1 and 2.5. Table 6.9, reproduced 
from Jessen's paper (1942), shows the numbers of sampling units 
which could be covered under each of the several cost situations. 
Substituting the value of л thus obtained in the equation for the 
variance, namely (52), he calculated for each of the 7 different 
sampling units the percentage standard errors of each of the 
18 items under study. The results relating to the cost situation 
in which the total expenditure was fixed at $1000 and сі is 
proportional to 1 and ^» proportional to 1 are reproduced in 
Table 6.10. It will be seen that for all except two characters, 
viz. number of sheep and number of eggs, the quarter 
section has yielded about the lowest standard error showing 
that, for the specified budget and су and c, as given, S, is 
the optimum size of the sampling unit for the collection of 
farm facts. 
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Jessen made similar calculations for different cost situations. 
Results obtained on six sampling units S}, 5, 5, 2S, 45 and 365 
are summarised in Table 6.11. Thus, row 1 of the table shows 
that when C is equal to $ 1000 апа с; = 1 and c,=1, for 
10 out of 18 characters under study, the half section S, will be the 
optimum unit of sampling. Row 3 of the table similarly shows 
that for the same total budget and с, remaining the same, but 
сі increased to 8 times its value, the quarter section S, would be 
the optimum unit for 16 out of the 18 characters. The result 
indicated that the size of the optimum unit of sampling decreases 
as су, the cost of enumeration, increases. This is confirmed by 
examination of other parts of the table. Similarly, the table 
shows that the increase in с calls for the use of larger sampling 
units. We also see that a large budget requires small clusters. 


From the study of similar results for 1939, Jessen recommended 
the use of the quarter section as the optimum unit of sampling for 
this kind of survey involving the collection of information on 
several items. This is an important finding since it points to the 
possibility of obtaining information on several related items in the 
compass of a single survey using the same optimum unit of 
sampling. This is also the experience in India with regard to the 
use of the village as the unit of sampling in agricultural surveys 
which has the additional advantage of being administratively 
convenient (Sukhatme, 1950; Sukhatme and Panse, 1951) The 
degree of accuracy attained necessarily varies from character to 
character in a given sample of clusters of the optimum size, but 
this can be adjusted using different intensities of sampling for 
different groups of items in the questionnaire. Thus, we may 
divide the questionnaire into three parts, information on part 1 to 
be collected from only half the total sample, that on part 2 
from say three-fourths of the sample and on part 3 from the 
entire sample. The items may be so grouped that information 
on all will be available with about the desired precision. Such 
sample surveys which include within their scope the collection 
of information on more than one item are called multipurpose 
surveys. 


We shall now give an algebraic solution of the problem origin- 
ally due to Cochran (1948) of choosing M and л so that the 
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variance given by (52) is minimized for a given value of the 
total cost, say С = C,. 


We form the function 
$ =V Ön) +e (anM + ст — Cy) (53) 


where ш is the Lagrangian undetermined constant. We next 
differentiate with respect to n, M and p and equate the results 


to zero. Thus, on differentiating with respect to л and noting 
that 


ЭР Vv 
m EN 69 
we get 


- T + u (aM + en) = 0 (55) 


Similarly, on differentiatin 


g with respect to M and equating the 
result to zero, we have 


oV 
ЭМ + pen = 0 (56) 


And finally differentiatin 


g with respect to и and equati zero 
caer в quating to А 


nM + ст = С, 


(57) 
On eliminating и from equations (55) and (56), we obtain 
LM M 1 
jM yc c ж 9 
Ga 
Ер (кент) 


Now solving equation (57) as a quadratic in nt, we get 


+ 6 + (се +4суСМ)ї 
te = 2с,М D E (59) 
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On substituting for n? in (58) and simplifying the algebra, we 
obtain 


А 4 
ERA 14 (14 есм) (60) 


Now it can be seen from (52) that 


1 oV 

УМ 
is independent of n. Equation (60) can, therefore, be solved 
directly for M. An explicit expression for М is, however, difficult 
to obtain and the solution has, therefore, to be obtained by trial 
and error method. On substituting the value of M so obtained 


back in (59), we obtain the optimum value of 7. 


Since V decreases as M increases, we may expect the left-hand 
side of (60) to be approximately constant whatever the value of 
M. An examination of (60) also shows that the left-hand side is 
independent of the cost factors while the right-hand side involves 
M only in combination with the cost factors. It follows therefore 
that M will respond to the variation in с, с and C in such 
a way that c,CM/cs? is approximately constant. It follows that 
M will be smaller if (1) c, increases, i.e., the cost of enumerating 
an element increases; (2) с, decreases, i.e., travel becomes cheaper; 
and (3) C is large, i.e., the amount of money available for the 
survey is large. The algebraic solution thus confirms the calcula- 
tions deduced from the actual data reproduced in Tables 6.9- 
6.11. 
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TABLE 6.9 


Numbers of Sampling Units which can be Covered, given 
Several Cost Situations, Two Expenditure Levels, 


and Seven Different Sampling Units 


Unstratified Sample in the State of Iowa 


Sampling Unit 


Mileage at 2¢/Mile Mileage at 5¢/Mile 
No. of — === 
Farms/ Length of Farm Visit Length of Farm Visit 
Sampling - — —ÓÀ— 
Unit” 15 60 120. 15 60 120 


Min. Min. Min. Min. Min. Min. 


Individual farm 
Quarter section 
Half section 
Section 

Two sections 
Four sections 


Thirty-six sections 


Individual farm 
Quarter section 
Half section 
Section 

Two sections 
Four sections 


Thirty-six sections 


A. Total Expenditure of $1000 

1-000 1644 650 371 1088 517 315 
0-914 1745 699 401 1140 551 339 
1:828 1073 392 218 764 336 192 


3-656 624 213 116 475 186 105 
7:312 347 113 60. 278 102 56 
14-624 187 59 31 156 54 29 
- 131-616 21 T 4 17 6 3 


B. Total Expenditure of $2000 

1:000 402 1452 803 2886 1223 712 
0-914 4293 1569 871 3057 1314 769 
1:828 2494 852 462 1900 744 421 


3-656 1388 451 241 1112 407 225 
7:312 749 235 124 623 217 118 
14-624 396 121 63 338 113 61 
. 131-616 44 14 7 38 13 7 


* Computed from the sample survey d 


ata, 
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and Taken at Random within the State, 1938 


TABLE 6.10 


Relative Standard Errors (9; of Item Means per Farm) 
Estimated for Samples of Different Sampling Units 


(Expenditure of 81000. 15-minute Questionnaire and 2e per Mile) 


Бе лог 


Sampling Unit 
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Items 

Number of swine 2-7 282 274 2-90 
Number of horses . , 1:83 1-93 1-87 1:98 
Number of sheep 9-61] 9:76 8:80 3-16 
Number of chickens 1-61 1-70 1:66 1:78 
Number of eggs yesterday .. 3-17 3:21 2:90 2-69 
Number of cattle 2.55 2:67 2:55 2:65 
Number of cows milked 1-98 2207 2-40 2-09 
Number of gallons of milk .. 2:34 — 2:45 2-32 2:39 
Dairy product receipts 2:99 3-11 2:93 2:97 
Number of farm acres 1:54 1:63 1:57 1:64 
Number of corn acres 1-95 2-06 1-98 2-08 
Number of oat acres 2:36 2-59 2:66 3:05 
Corn yield -82 -90 :94 1-09 
Oat yield -84 -88 -84 -86 
Commercial feed expenditures 6:23 7:06 7:60 9:14 
Total expenditures, operator 3:96 4:36 4:51 5:21 

Total receipts, operator 3-16 3-49 3-64 4:23 

Net cash income, operator .. 3:54 3-82 3:84 4-26 


2S 


NN 
a = 
Е 3 


45 


4-11 
2:80 
7-44 
2:57 
2:45 


365 


9-99 
6-87 
7-44 
6-34 
2:45 


8-66 
6-79 
7-17 
8-55 
5:58 


6:88 
12:76 

4:73 

2-71 
43-07 


22-36 
18:39 
16-82 
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TABLE 6.11 


Summary of Sampling Unit Efficiencies 


Number of Items Most Efficiently Estimated by the 
Six-Grid Sampling Units, 1938 and 1939 


Expenditure, Mileage Sampling Unit 
Rate and Questionnaire 
Length 5, Sa 5 25 45 365 


Expenditure of $1000 
T. 2e/15 min. 1938 .. 6 10 


P 1 1 
1939 |. 6 E i 2 2 
IL. 2/60 min. 1938 .. 13 3 1 
1939 14 2 2 2 
HI. 2е/120 min. 1938 .. 16 1 1 
1939 16 2 2 
IV. 5еј15 min, 1938 | 12 2 1 1 
1939 4 ҒЫ + 2 2 
У." 5/60 min. 1938 6 10 1 
1939 7 81 2 2 
УІ. 5/120 тіп. 1938 .. 114 4 1 
1939: -. 12 a i 2 
Expenditure of $2000 
УП. 22/15 min. 1938 .. 7 9 1 
1939 .. % 8 2 2 
ҮШ. 22/60 min. 1938 .. 16 ! 
1939 15 1 2 2 
IX. 2e[l20 min. 1938 ., 16 1 1 
1939 16 1 2 2 
X. 5/15 min. 1938 .. 5 11 1 1 
1939 6 8 2 2 2 
ХІ. 54/60 min. 1938 .. 1% 3 1 
1939 12 Т 5 2 
ХП. 58/120 тіп. 1938 |, 124 “2 
1939... А ? S E 2 2 


~ 
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B. UNEQUAL CLUSTERS 


6b.1 Estimates of the Mean and their Variances 
Let the i-th cluster consist of M; elements (7=1, 2, ..., №) and let 
M, = x M; denote the total number of elements and 4 —Mj/N the 


average number of elements per cluster in the population. Then the 
mean of the character per element in the j-th cluster will be given by 


Mi 
bed У (61) 
jei 


and the mean per element in the population by 


N Mi N M 
Ж Ju 5 Му. 
іші ті = (62) 


' 
ii 
| 


Mo 
tes of the population value of the mean per 


sample of п clusters. 
the simple arithmetic 


Several estima 
element can be formed from a random 
We shall first consider the simplest, viz., 
mean of the cluster means given by 


а 1 = = (63) 


= y, 
n ds 


Va. 


It is easy to see that this estimate will not give an unbiased estimate 


of the population value, for 


EG.) =} ) EO.) 


= Ур, (64) 
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unless М; and ӯ; are uncorrelated. It is likely, however, that 
for large л and for a population for which Мұ do not appreciably 
vary from one cluster to another, this estimate may not be 
materially biased. 


Since Pn, is a biased estimate, the error in Ön. will consist of 
two components: one arising from the sampling variations about 
its own mean, viz., the unweighted mean of the cluster means 
in the population; and the other arising from the bias compo- 
nent. The expected value of the square of the total error in jn, 
is called the mean square error. To evaluate it, we write 

Fi —JX. = Jy — Ән. + s. 25 (65) 


where y, is the simple arithmetic mean of the cluster means, 
given by 


N 
à 1 z 
JN. == JA. 
іші 


Squaring both sides of (65) and taking expectations, we obtain 
M.S.E. (,.) = V (ӯ, ) + (bias)? 


_ МУ—"п 8,2 = ж 
pir cb eg (66) 


z| 


where Sẹ? is defined in (6), namely, 


1 N 
S? = Weal > Oi. — Fy. y 
іші 


An unbiased estimate can 


also be formed. Я 
be the arithmetic mean bas med. The simplest would 


ed on cluster totals given by 


эе aos КҮ 
Vn. =F Му. (67) 


It is easy to see that it is unbiased, for 


eer €. Р 
EG,) =} EMD.) 
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The sa i yj 
mpl 1 1 
pling variance of this estimate can be written as 


VOU) = Sst Les 
; (69) 


Where 


1 А/м) ы 
Axe ik iVi. 5 
М-1 (4 ы 3 p 


' depends upon the 
likely to be larger 
y that their 


It wi 
ill t у 
Variation : е. that the variance of Yn. 
ieee ee and is, therefore, 
Product j Jn, unless M; and ју, vary іп such a wa; 
m is almost constant. 3 
third esti pee 
estimate which is biased but consistent is given by 
ја = Ў Mj, 
те ЖШ 7 
2 M, 
two rand means and is the ratio of 
is а Bens variables. We have already seen that this estimate 
е BE estimate, but is consistent, the bias decreasing with 
estimate ase in m. A first approximation to the variance of this 
(30) or e given by replacing y; by Mii. and x; by Mi in equation 
hapter IV. We obtain 


Vi, к= N —n Lo » i у y 
те ты 


It is 
а wei 
Weighted mean of the cluster 


Nn N-! 
TR 72 
sa т s,” em 


Nn 
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where 
M 
Wo. е ae Ма. а 3 
S E i> ж 2.) (73) 


which, as we have seen, is a satisfactory approximation to the 
actual variance provided n is large. 

The variance of the ratio estimate is smaller than that based 
on the simple arithmetic mean of the cluster totals provided p is 
larger than С.У. (М;)/2 С.У. (Miji.) and consequently the estimate 
Jn.” is expected to be more efficient than Jn.’ in large samples, 
whenever M; and Miji are highly correlated. 


6b.2 Probability Proportional to Cluster Size: Estimate of the 
Mean and its Variance 
The basic theory of sampling with varying probabilities of 
selection has been given in Sections 2b.2, 2b.3, 25.4 and 2b.5 
of Chapter IL In this section we shall give its application to 
cluster sampling. 


Let Pj, denote the probability of selecting the i-th cluster 


( = 1,2, ..., М) at the first draw, P Р; being 1, and define a 
variate z by ~ 


My; 
24 = aD (74) 
whence 
z, = MÀ 
5 МР, 


Then it is easily shown that t 
to the population mean yis 
by 2,, we have 


he expected value of Zi, is equal 
Denoting the expected value of Zi. 


=p, (15) 


—————— 
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It follows that 

Е (2) ===>. (76) 
When the selection probability is proportional to the size of the 
cluster, in other words, when Р; = Mi/M,, the variate z becomes 
identical with y, and 2, = у. We thus reach an important 
result, that a simple arithmetic mean of the cluster means, under 


a system of sampling with probability proportional to size of 
cluster, gives an unbiased estimate of the population mean y... 


To obtain the sampling variance of Ју, we proceed exactly 
step by step as in Section 20.2 and reach the same expression 


as in (136), namely, 


оь? (77) 


N 
оу“ = У Р, (2. — 2.) (78) 


The estimate of the sampling variance of Zp, is also given by 
the same expression as shown in (143) in Section 25.3, namely, 


Est. V(%,) = ~ (79) 


where $52? is the mean square between cluster 24/5 in the sample, 
defined by 


шегад а-ы (80) 


When Р; = M;/Mo, we have in. = Jn. 
„б (81) 


У Oa КО, 
of = ) 0.) (82) 
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and 
Est. V(z,) = r1 (83) 


where 
PN M j y py j 
зб = =] 0.) (84) 


6b.3 Probability Proportional to Cluster Size: Efficiency of Cluster 
Sampling 

It can be shown that the relative change in variance with a 

cluster as a unit of sampling in place of an element under the 

system of sampling with probability proportional to size of cluster 

is given by an expression similar to that in the case of equal 


clusters, viz., (M — 1) p. To evaluate p, We start from equations 
(13) and (14) and write 


? (85) 
By definition 

EQ, — 3. 5o (86) 
Further, let 

E (уд XL -- cj (87) 
and 

E y=. = (88) 


To obtain the expected value of the first term in the numerator 
of (85), we use the identity 


i 


( » = ANS Mi 
(5 uF.) Li = () ж Б, 


Ms 


Mi Е 
+ A. (0 =.) On — ji.) 
which can be written as 


E 0y- 9*0 М, (М, — 3) E 9,5.) oa у, део 
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Hence 


2 
o? 


7 Е {Qu Fi.) 0a—5.012 ти 1 е (89) 


Taking the expectation of the expressions on both sides of (89) 
for variation in i, we obtain 


. 
E(9y-)0u—A9 =— 2, A * po (90) 


Assuming o; to be constant and equal to ow, we have 


a 


N 
з y^ M, 
би) бај = — пр Ди Т 91) 


Now the expression 


N 
DA 
N Mes 

ізі 


may be satisfactorily approximated by М/(М — 1) We therefore 
have 

өз 6,” 
Ебуу- ў.) Оњ). » = = ЖЕ (92) 


Hence, substituting from (86), (88) and (92) in (85), we have 


аа 2 
- = + о, 
D VM Tae (93) 
p= EJ 
Also 
o? —o, +o? (94) 


Hence, eliminating cw? from equations (93) and (94), we have 
Mo = о {1 + (M — 1) р} (95) 


It follows that the sampling variance of the mean of n clusters 
(on an element basis) selected with probability proportional to 
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their size is given by 


ыы Жый isl (96) 
=й! + D pj 


lf а simple random sample of nM elements had been selected 
independently with replacement, the variance would have been 
о? 


У б) = = к 


So that the relative change in variance is given by 


99 8) -ne (98) 


6b.4 Probability Proportional to Cluster Size: 


Relative Efficiency 
of Different Estimates 


In Section 25.2 of Chapter П we remarked that the method 


sampling of clusters. 
comparisons. 


Of the three estimates appropriate for simple random sampling, 
namely, In» Jn.’ and Vn” 


» it is necessary to consider only the 
first and the last, since the estimate J,’ will generally be less 
efficient than either Fn. 


ог P4". We shall, therefore, make 
comparisons under two heads: 


(A) that of 2, with Эл. and 


(B) that of Zp, with y, ". 


We shall assume that sampling is carried Out with replacement. 
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(4) Comparison of Zn, with Pn, 


We have seen that 2,, is an unbiased estimate of y, with its 
variance given by 


N 
зы фра EN 
VG) = yy ii 0.—.) (99) 


The estimate Pn, on the other hand, is a biased estimate of the 
population value. In comparing it with 21, we have, therefore, 
to consider its mean square error which, ignoring the finite 
multiplier and replacing N — 1 by N, is given by 


N 
MSE фи) e WD, Gu да + Gu. — 9. (100) 


Using the identity 


Maz 
~ 
Si 
| 
~ 
ІІ 
Шыр 


Oi. — y. Y: + М Gin. =з ЈУ 


equation (100) can be rewritten to read 
rage 1 
МЕ ба) e а 2 0-9. + (1 — 5) Gu 9.8000 


The difference between (101) and (99) works: out to 


M.S.E. (ў„.) — M.SE. Gn.) = — ES p (ж = 1) (биз А 
іші 


+ (1 – i) би. =.) 000) 


In order to examine this difference we shall group together 
clusters of the same size and write the first term in (102) as 


shown below: 
18 
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=~ 2% (= a 1) i} G,—5.* 0103) 


where the summation y is taken over the М; clusters of size M; 
j 
each, and the summation X is taken over the k groups, where 
m i 
У Ni— М. We shall restrict the discussion to the case where 
іші 
Е (93. | М) =}... Clearly, 


Ni 


x 2,09. 5» 


ізі 


will then represent the variance of Jj and may be denoted by 
У (Fj, | Mi. Equation (103) can now be put in the form 


Cu 1 м(%- 1) и, м) (104) 


and is seen to depend upon the relationship between the variance 
of the mean of a cluster of given size and the size. If the 
clusters were randomly formed, the variance will be inversely 
proportional to the size of the cluster. Owing to the fact, 
however, that the elements of a cluster will be correlated, the 


variance will seldom decrease quite as fast. We shall examine 
the difference for three Special cases: 


I. V(,|M) = а constant, say у 


IL VG, M) = qe 
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and 


Ш. ИФМ) = ys 


Case I.— 
V (9, М) -у 


Clearly the value of (104) is zero, giving us 


M.S.E.(j,,) — М.8.Е.(2,) (i = 3 Ön. — F.) 
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(105) 


which is always positive, so that 2,, is expected to be more 


precise than the estimate J», for this case. 


Case II.— 
Y G,1M) = yr 


Equation (104) can be approximated by 


ll 


к. 
SPC не 
nN ӘМ (a к Мис 
М ) 


ш hi Е (6) 


тА я -1)- 


Ir 


N 2 
PIC EDI 
nNM M 
del 


(106) 
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which is again positive, so that for this case also we can expect 
Zn, to be more efficient than у. 


Case III.— 
YG,1M) = р, 


It can be shown that, following the approach in Case II, 
equation (104) can be approximated by 


N N 
1 M, 5 v» yl _2у € AG — 1) 
nN ја e 1) Or. — у.) = nNM? M 
i= i= 
3 D (107) 


so that for this case also we would expect 2, to be more efficient 
than Pr.. 


These results are in fact obvious from an examination of the 
first term in (102). This term, ignoring the sign, represents the 


covariance between М; and (7; — y. y Now, ordinarily (y; —y. ? 
will decrease as М; increases so that the covariance ‘will be 
negative and conse 


quently the expression on the right-hand side 
of (102) wil be positive. 


(B) Comparison of Zn, With y," 


The mean square error of Уп 


d when the finite multiplier is 
ignored and л is large is given by 


N 
У MES 2 
M-S.E. (y,") ~ aÑ Lu та Ou 9X (108) 


іші 


Deducting from this the M.S.E. in (99), 


M.S.E. (9,") — M.S.E. (2, ) 


we obtain 


=i л 7 ОМ –1) а 


Restricting again the discussion to the case in which E (у; )—y... 
we notice that the relative precision of ӯ," and Zn. depends upon 
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the relationship between the variance of y; and Mi. 


Case I.—Let 
VG, | M) = 


Equation (109) can then be written as 
= M, 
uk mem: ) e EE: 
M.S.E. (у, )— M.S.E. E.) = nN = e 1) 


1212 (%) (110) 
It follows that Z,, is more efficient than Sn 
Case II.—Let 
РФМ) = у 


The right-hand side of (109) reduces to zero, showing that the 
two estimates are of equal precision. 


Case III.—Let 
Y 6. | M) = Me ` 


Equation (109) then takes the form 


M.S.E Gin) — MSE Gn), А 
= aie a 08-19 (и) 
ate Dw tot + ұғу 
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Thus for this case, in contrast to the previous two, the estimate 
Zn. is expected to be less efficient than the ratio estimate in 
simple random sampling. 


Example 6.5 


Table 6.12 gives the number of villages and the area under 
wheat in each of 89 administrative areas* in Hapur Subdivision 
of Meerut District (India), and Table 6.13 gives the analysis of 
variance on a village basis. It is required to estimate the total 
area under wheat in the subdivision using an administrative 
circle as the unit of sampling. We shall assume that a sample of 
20 circles is to be selected. Calculate the sampling variance of 
the estimate of the total area under wheat for each of the 
following procedures of sampling and estimation: 


(а) equal probability, mean of the cluster means estimate, 
(b) equal probability, mean of the cluster totals estimate, 
(c) equal probability, ratio estimate, 


(d) probability Proportional to the size of the circle, mean of 
the cluster means estimate. 


Also calculate the Variance of an equivalent sample with the 


village as the unit of sampling and compare the relative efficiency 
of the various methods, 


(а) Equal Probability, Mean of the Cluster Means Estimate 


M.S.E. (Mg, ) 
= MaJN—n 1 = ы. 
M s: EAT De Oi— In.) + Gn — 5. " 
іші 
= 2 69 1 
729) {д ` gg (6499209) + (387.35—328-02): | 


| 


89401 (2862-9 + 3511-75) 
5699 x 10? acres? 


1 


* These areas are known as Patwari Circles in the local terminology 


* 
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TABLE 6.12 


Number of Villages and the Area under Wheat in the 
Administrative Circles of Hapur 


Circle Number of Ateaunder | Circe Number of Аш" 

Мо. Villages (Acres) No. Villages (Acres) 

(м) (Mj) © (м) (Mii) 
1 6 1562 29 2 583 
2 5 1003 30 4 1150 
3 4 1691 31 3 670 
t 5 i 32 2 499 
5 4 458 33 4 714 
6 2 136 34 4 1081 
7 4 1224 35 1 3m 
8 2 996 36 7 2675 
9 5 475 37 3 ses 
10 1 34 38 2 1412 
11 3 1027 39 2 748 
12 4 1393 40 5 00 
13 з 692 41 2 542 
14 1 524 42 4 2050 
15 1 602 43 6 2330 
16 3 1522 44 ! 20) 
17 4 2087 45 2 52 
18 8 2474 46 2 e 
x 19 2 461 47 5 sie 
20 d 846 48 1 1d 
21 3 1036 49 ou 
22 4 948 50 20 a 
23 4 1412 51 У л? 
24 3 2 52 9 3622 
25 5 2111 53 2; p 
Ей ; e 12 2 1584 
27 3 814 БЫ З a 
28 1 319 56 5 e 
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TABLE 6.12—Contd. 


Circle Number of лер Circle ^ Number of == 
Мо. Villages (Acres) No. Villages (Acres) 
(i) (м) (Mic) @) (м) (Mae) 
57 3 622 75 4 669 
55 2 591 76 1 1187 
59 5 273 77 2 852 
Gu 2 781 78 1 51 
61 2 1101 79 1 1265 
62 2 799 80 8 1423 
53 2 601 81 2 794 
5 Ё 928 82 1 1604 
s ы ШП 83 3 1621 

s 1 1208 84 2 1764. 
67 5 1633 85 6 2668 
65 4 902 86 1 1076 
69 3 1286 87 1 348 
10 5 1299 88 4 4 
u 7 1947 89 4 Tus 
ue 3 741 

29, 2 574 Total 299 98078 
th 7 2554 

TABLE 6.13 


Analysis of Variance of Areas under Wheat in Villages 
in Hapur Subdivision (Acres)? 


Source of Variation Degrees of Freedom Sum of Squares Mean Square 


Between circles e" 88 10924581 124143 
Within circles between 

villages xs 210 9588011 45657 
Total population ae 298 


20512592 68834 
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It will be noticed that the bias exceeds the standard error proper 
owing to the large variation in M;. The method must, therefore, 
be rejected from further consideration. 


(b) Equal Probability, Mean of the Cluster Totals Estimate 


y Е Ж us.) 


1 М / 1 N 2 
= езү А ы Ж m 
nii | Nn N-I n С = us.) 
i=1 


| "e 69 
| ru npn 


* 513613 


|| 


1577 x 10° acres? 


| 
\ 
| (с) Equal Probability, Ratio Estimate 


1 69 x 89 
= . 3 
| ij 30204 


= 1050х 105 acres? 


| n ч 
k (ea - и (1+ CrM) 


| 


3 2 os) 105 
1050 (1 + 39 * (з-360)#/ 7 


1107 x 105 acres? 


| 


282 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


(4) Probability Proportional to the Size of the Circle, Mean of the 
Cluster Means Estimate 


N 
= 1 М, > 
= 2-5 fe — yj 
VO.) Mf «ad бе.) 


іші 
= (299%. x * (36537) 


= 1633 х 105 acres? 


(е) Village as Unit of Sampling, Equal Probability, Mean per 
Village Estimate 


ы) =a Mo 
5 NM — - 
У (Миа = Ми MM. 1 S oa.) 
— 2 
іші 
299 x 69 
= “9 (68834) 


= 710 105 acres? 


The relative efficiencies of the different methods are then as 


follows ; 
- 4 » Relative 
Sampling Unit Sampling Method Method of Estimation — Effici- 
ency 
(a) Circle Equal probability Mean of cluster 12 
| means 
(b) Circle Equal probability Mean of cluster 45 
| totals 
(c) Circle Equal probability Ratio 64 
(d) Circle Probability ргорог- Mean of cluster 43 
tional to size means 
(e) Village Equal probability Mean per village 100 


The very low efficiency of method (а) is partly due to the 
presence of serious bias in the estimate. Of the other methods 
with the circle as the unit of sampling, the ratio method is seen 
to be the most efficient. The explanation is provided by Table 
6.14 showing the two-way classification of circles by the area 
under wheat and by size in terms of the number of villages. 


CHOICE OF SAMPLING UNIT 283 


TABLE 6.14 


Frequency Distribution of Circles by Area under Wheat 
and Number of Villages 


3600<3800 1 
<3600 1 
<3400 
<3200 
<3000 
<2800 1 1 
<2600 1 1 
<2400 
<2200 2-2 
<2000 
<1800 1 1 1 1 1 
<1600 3 1 
<1400 
<1200 2 1 
<1000 
< 800 2 
< 600 1 

200< 400 
0< 200 2 


Area under Wheat in a Circle 
ә 


Noc шә о ~ 


Mo ~ 


5 6 1 8 9 10 


1 2 3 4 

Number of Villages in a Circle 
and Mj, is seen to be 
fficient of correlation 
nder wheat in circles 
dent of the size, and 
ethod (with equal 
when the 
The 


P put ee 


1) : A 
н tionship between the two, i.e, Mii. 
cing TN linear, the value of the coe 
Of the 64. The variability among areas U 
eu ne size is seen to be rather indepen 
АИ relative superiority of the ratio ш dus 
clust ШЕу) over the simple arithmetic mean es gh 
ers are selected with probability proportional 
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village as a unit of sampling is seen to be far superior to the 
circle, and since the village is also known to be administratively 
convenient, it is preferred to the use of the circle as the unit of 
sampling in most agricultural surveys in India. 


а 


Hansen, М.Н. and 
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CHAPTER УП 
SUB-SAMPLING 


7.1 Introduction 


So far we have considered only sampling procedures in which 
all the elements of tbe selected clusters are enumerated. We 
also saw that the larger the cluster the less efficient it will usually 
be relative to the element as the unit of sampling. It is therefore 
logical to expect that, for a given number of elements, greater 
precision will be attained by distributing them over a large 
number of clusters than by taking а small number of clusters 
and sampling a large number of elements from each of them or 
completely enumerating them. The procedure of first seiecting 
clusters апа then choosing a specified number of elements from 
each selected cluster is known as sub-sampling. It is also known 
as two-stage sampling. The clusters which form the units of 
sampling at the first stage are called the first-stage units and 
the elements or groups of elements within clusters which form the 
units of sampling at the second stage are called sub-units or 
second-stage units. The procedure is easily generalized to three- 
stage or multi-stage sampling. As an example of three-stage 
sampling, we may refer to crop surveys in which villages are the 
first-stage units, fields within villages are the second-stage units 
and plots within fields the third-stage units of sampling, the 
correlation between yield of adjoining portions of the same field 
rendering it unnecessary and also uneconomical to harvest the 


whole field. 


7.2 Two-Stage Sampling, Equal First- 
the Population Mean 


We shall first consider the case of 


that the population is composed. of pe 
N first-stage units of M second-stage fits each. Let n deno e 


the number of first-stage units in the sample and m the Rd 
of second-stage units to be drawn from each selected first-stage 
unit. Further, we shall suppose that the units at each stage are 


selected with equal probability. 


Stage Units: Estimate of 


al clusters and assume 
elements grouped into 
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Now let, as previously, 


Уу = the value of the j-th second-stage unit in the i-th first- 
stage unit (j—1,2, ..., M;i— 92.8), 


Jı. = the mean рег second-stage unit in the i-th first-stage unit 
in the population (= 1, 2, ..., М) 


and 


= the mean per second-stage unit in the population 


Further, let 


tem Vij 
i 


the mean per second-stage unit of the i-th first-stage unit 
in the sample 


|| 


апа 
ees 
УХА 
Ут = ED У; 
£ 13 
n 
el 2 5 
= Ут 
= the mean per second-stage unit in the sample 
Then it can 


be shown that of all the linear estimates of the type 
n т 
x 2 dijYij, where the di; are constants 


selected sample, the sam 
estimate of the popula 


independent of the 


ple mean Pum provides the best unbiased 
tion mean у. In the next section we 


ЧЕР 


и - ___ 


— —— Бай 
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7.3 Two-Stage Sampling, Equal First-Stage Units: Expected 
Value and Variance of the Sample Mean 


Since the sample is selected in two stages, the expected value 
is also appropriately worked out in two stages, first over all 
possible samples of m from each of n fixed first-stage units and 


then over all possible samples of п. Thus, we have 


Еби) = Е | Ў. 


| 
ty 
Caras 
Di 
& 
= 


Since the first-stage units are equal, 
Jy. =D.. 

Непсе 
EG) =Ў. ү 


thus showing that the simple 
gives an unbiased estimate © 
he variance of the sample me 


mean of all elements in the sample 
f the population mean. 
an is given by 


By definition, t 
V Gan) = Еби Ё (УА 
) ps: (2) 
=E Qm = Jn.) 


This can be written as 


уби) = Еби ја Ie 00) 


у. PEE On Ўн) 


EE am * A 
+ 2E {Gum — In) Om 


=>. 9 
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where jn, denotes the simple mean of; first-stage unit means, 
given by Е 


et. a д 
Фа, == 2 ) Уж 
Now 


duc RI Ж (т =R 


so that 
- a 1 а on 2 
E Om = In? =» E [50-27] 


1 Fe gis LEM " 3 + 
s E[ Z0. + ZG mF.) 05] 


= „[Е 2 Еби уд 


pn 


FEZ EuD) Gem Fe) [519] 6 


The value of the second term under the summation sign is clearly 
zero since sub-samples are drawn independently from the i-th 
and i’-th first-stage units and the value of the first term under 
the summation sign is given by the well-known result 


ч a eee 1 1 
Е4ба— а |i} = (2 – м) 8 (5) 
where 
М - 
ZI Ош ЙЫЛ 
EDI ес дЫ 
E нгі 


whence we obtain 


a S "Мы! ЗА E 
— 2 = - >=! 2. ax 
Ебт— ў.) = 2 Е ) ( x) 52 


ТУТ IN x 
ie. Sé (6) 
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where 


Also, from (5) and (6) of Chapter VI, 


EG Ty = (; – у) 5 (8) 
where 
N 
PEU 
5: = = cT 


The value of the last term in (3) is clearly zero. For, 


E {Gum — Fn) Gu. — Yu.) 


= Е |o io x 1) Eaa) a 


= Е [(Pa. — Ји.) x 0] 
es 0 (9) 


Substituting from (6), (8) and (9) in (3), we get on interchanging 
the order of the first two terms 


m 1 1 x 1 1 
И Que) == m 4! Se + С => 


The variance is seen to be made up of two components. If 
the selected first-stage units had been completely enumerated, in 
other words, if m = M, the variance of the sample mean would 
be obviously given by the first component only. Actually, each 
Selected first-stage unit is only partially enumerated by means of 
a sample of second-stage units drawn from it. The second term 
in (10) thus represents the contribution to the sampling variance 
arising from sub-sampling the selected first-stage units. In fact, 
setting т = M in (10) we have equation (5) of Chapter VI. 

19 


852 
> 
п 


(10) 
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When n = М, or in other words, every first-stage unit in the 
population is sampled, we are left with the second component 
to represent the variance of the sample mean. This case corres- 
ponds to stratified sampling with first-stage units as strata and 
a simple random sample of m drawn from each of several strata. 
We can thus look upon a sub-sampling design as a case of 
incomplete stratification as it were, the first component repre- 
senting the additional contribution to the variance of a stratified 
sample arising from incomplete stratification. 


When N is large relative to n, so that (М — n)/N can be taken 
as unity, we have 


я 5,2 ШЕ S 
PO == + (=) = (1) 

When M is large relative to т, so that (M — т)/М can be 
taken as unity, we have 


E D. lA un (5 
Уба) = (у = м) Sot om (2) 


And when finite multipliers at both stages can be taken аз 
unity, we are left with the simple expression 


S? | 5,2 


У а) = mr T (13) 


7.4 Two-Stage Sampling, Equal First-Stage Units: Estimation of 
the Variance of the Sample Mean 


The calculation of the variance of the sample mean in two-stage 
sampling involves the estimation of Sy? апа S,?. The simplest 
way of estimating these is to define the corresponding quantities 
for the sample and obtain their expected values. 


Let sp? denote the mean s 


| quare between first-stage unit means 
in the sample defined by 


ај cs у os 
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and 2 denote the mean square between second-stage units drawn 
from the i-th first-stage unit defined by 


5 Qu — Ја) 
32 m e (15) 
т—1 
Equation (14) can be rewritten as 
(п — 1) 52 = X Dan = Mpa 
whence 
(n= 1) ES = E (2 Sint) — nE Omi) (16) 


Now, to evaluate the first term in (16), we write 


E (È jut) = E (2 оқ») 
-eE pe + G-a] 
= P ја + А (а = E) Я | qm 


іші 


The value of the second term іп (16) can be directly obtained from 
(10). For, by definition, 


P (Pum) =E Onn”) => Ўн? 


whence 

09 = (1 — у) 52 + (2 – 10) вәлә ав) 
Substituting from (17) and (18) in (16), and recalling that 

(1082 = X5 — Nin? ; 
we obtain 


EG) = + (2 — 2) 52 (19) 
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From (36) of Chapter П, we know that for fixed i, 
Е (57) =S? 


Also, for varying i, 


n N 
І ° E D) R. = 2 
(праг) -3 52 =5, 
. іші 


whence 


: у „2 = |68 
(бе) (20) 


We thus have 


Est. S? = 5,2 


_ 2 @т—1)5ё 
|.) n(m — 1) QD 


— mean square within first-stage units in the analysis 
of variance of the sample 


and 


Est. S? = 52 — L — a se (22) 

When m= М, the second term in (22) vanishes and we are 
left with the known result appropriate for one-stage sampling 
without sub-sampling. We note also that in two-stage sampling 
the estimate of Sy? is less than the corresponding mean square 
between the first-stage unit means in the sample, as one would 
expect, since sp? is based on estimates of first-stage unit means 


and not the true values and therefore subject to an additional 
component of variation. 


Substituting from (21) and (22) in (10), we have 


E БАН 4a dk o th. 
Est. V (o) = 8) s 4 у (= at) 2 (23) 
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When (N — n)/N can be taken as unity, (21) and (22) will still 
hold, giving on substitution in (11) 


Est. У (фы) = 5 


(24) 
Mean square between first-stage units in the 
analysis of variance for the sample | 

пт 


When (M — т)/М сап be taken as unity, we have 


Est. 5,2 = 52 
and 
г = 1 1 2 1 52 
Est. У бу = (5 — у) 58 + apes Be (26) 


When (№ — n)/N and (M — m)/M can each be taken as unity, 
(25) holds good, giving on substitution in (13), for an estimate 
of У (рат), an expression identical with (24), namely, 


Est. Уа) = 7 Q7) 


Mean square between first-stage units in the 
analysis of variance of the sample 
nm 


7.5 Distribution of Sample between Two Stages: Equal First- 
Stage Units 


The expression (10) for the variance of the sample mean in 
a two-stage sampling design shows that tbe precision of a two- 
stage sample, apart from the values of Sẹ? and 5,2, depends upon 
the distribution of the sample between the two stages, or in other 
Words, on n and т individually. The cost of surveying a two- 
Stage sample will likewise depend upon the values of л and т. 
In this section we shall consider the problem of choosing л and 
m So that the variance of the sample mean is minimized for given 
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cost. Alternatively, we can choose п and т so as to provide 
an estimate of the desired precision for minimum cost. 


We shall first consider the simplest case in which the cost of 
the survey is proportional to the size of the sample, so that 


C=cnm (28) 


where С =the total cost of the survey.and c is a constant. Let 
the total cost of the survey be fixed at, say, C = Cy Then 
from (28), we have 


mom Ca (29) 


cn 


Substituting from (29) in the expression for the variance of the 
sample mean given by (10), we obtain 


уби) = (1 — у) S + (& — ы) 59 
5; 2 2 S2 
- " -W) utem (30) 


which is a monotonic decreasing function of n if Sp? — 5,2/М 


is positive, reaching its minimum when п assumes the maximum 
value, namely, 


=> 


а (31) 
Equation (29) then gives 
т=1 (32) 


If Sp? — 5°/М < 0, which for large N is equivalent to stating 
that the intra-class correlation is negative, the right-hand side of 
equation (30) becomes a monotonic increasing function of n, 
reaching its minimum when n is minimum, given by 


С 
еМ 


Hn 


In other words, there is no sub-sampling. 
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The alternative approach of estimating the population mean 
with the desired precision for minimum cost leads to the same 
solution for m. For, let Vo be the value of the variance with 
which it is desired to estimate the population mean, so that 


io Түше. у. fh 1582 
Р = n a Se (s 3%) п (95) 


Solving for п, we get 


n = as (34) 


. _ 82 
bo "34 S2 
С = ст “+ F — (35) 
№ + “+ У 


Clearly, for Sy? — Бујм> 0, С attains its minimum when 
m assumes the smallest integral value, namely 1; and for 
Sy — §,,2/M < 0, the minimum is attained when № = M. 


It should be noted that the optimum distribution is independent 
of the variability of the character under study. This suggests 
the advisability of enlarging the scope of the questionnaire to 
carry several items whenever the field cost is proportional to the 
number of second-stage units or interviews, the sub-sampling 
design with one sub-unit from each selected first-stage unit being 
the most efficient in this case, provided, of course, Sp” — SIM 
is positive for each item, and Сус < №. 


We shall next consider a more general ‘case when the cost of 
the survey is represented by 


С = сп + cnm (36) 


where с, and с, are positive constants. From (36) and (10), we 
obtain 


М 
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(су + ст) [ (5. = u 5.) + 1 5e] 


с [V Om) + SE) 


| 
Aa] 
oN 
Dn 
ы” 
| 
&I 
Ul 
x 12 
ы” 
+ 
e 
in 
T ә 


Clearly, the minimum value of (37) will provide the optimum 
allocation for both the cases, when either C or V is fixed in 
advance and V or C minimized. 


For Sy? — 5,2/М > 0, (37) can be put in the form 


e(vs 1 ва) = Race: 52) + ves | 


„извора е] 


and is minimum when the Square term in m is equated to zero, 
or in other words, when уй is given by the nearest integer to 


сі Р 5,2 Е (38) 
% ge is 5e 
у мо 


or, approximately by 


/аб-1) a» 


where p is the intra-class correlation within first-stage units. 


2 


For Se? — Sy?/M < 0, the expression on the right-hand side 
of (37) is minimum when m is the maximum attainable integral 
value. If the total cost fixed for the survey, namely C, exceeds 
с + СМ, we have 


т = М 


т 
( 
і 
| 
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and л is the greatest integer in 


Co 


(40) 
а F eM 


If C, is less than c, + cM, т is the greatest integer іп 


буш 
Co 


and ñ is 1. 


It will be noticed that 15 now dependent upon the magnitude 
of the two cost constants, as also on the intra-class correlation 
of the character under study. In general, if Sp? — Sy?/M > 0, 
the optimum value for m will be smaller if: (i) the travel between 
first-stage units and other costs which go to make up c, are 
Cheaper, (ii) the cost of collecting sub-samples from the selected 
first-stage units is larger, and (iii) the intra-class correlation is 
large. It follows that the optimum sub-sampling rate will change 
rom character to character. For a related group of items, 
however, for which the value of p does not materially vary it may 
be possible to hit upon a satisfactory optimum for m without 
appreciable loss of efficiency. Sample surveys for the estimation 
of acreage under major crops provide evidence on this point. 
Table 7,1 gives the values of Sy?, S? and p for the acreage under 
Wheat, gram, maize and sugarcane calculated from the records of 
complete enumeration during 1936 of all the villages in Hapur 
Subdivision of Meerut District (India). The village constituted 
the first-stage unit and a grid of 8 fields formed by grouping 
Consecutive fields within a village was taken as the second-stage 


= It is seen that p is of the same order for three out of 
ы ош Crops and it seems feasible to hit upon a satisfactory 
ue 


> Of m without very much departing from the optimum 
e Individual crops. When, however, minor crops are also 
dr dnd in the survey, the value of p is found to vary consi- 
fl and it is not possible to design an efficient шо: 
fue d With а common sub-sampling rate for all items. i vi ks 
mi de all general purpose surveys and one solution at = e 

™mizing the loss of efficiency lies in grouping together relate 
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items of the questionnaire and using different sampling designs 
for the several groups. 

TABLE 7.1 


Values of the True Variance between and within Villages 
for Area under Different Crops in Hapur Subdivision 


(Biswas?|Grid) 
Area under В ЕР 
Сгорв іп 1000 5,2 Se p 
Biswas* 

Wheat P E 1962 483-8 4991-4 -0884 
Gram sis ne 552 121-2 700-2 :1476 
Maize 2 xS 556 60-3 698-2 :0795 
Sugarcane .. де 526 110-6 1127:7 +0893 


* Biswas is a local unit of area. 
Example 7.1 


A yield survey on paddy was carried out in West Godavari 
District (India) in 1946-47. Five villages were selected in each 
of the seven strata into which the district is divided, three fields 
were harvested in each village and one plot of 1/100 acre was 
harvested in each field. The data are reproduced in Table 7.2. 
Obtain pooled values of sẹ? and sy? for the district and the 
estimates of 5,2 and 5,2. Finite multipliers at the sub-sampling 
stage may be ignored. 


Calculate the sampling variance of the estimate of the district 
mean yield and the percentage standard error. 


Assuming that the sample of villages is to be allocated in 
proportion to the numbers of villages in the several strata 


and that the cost in rupees of the Survey is represented by. 


С = 7n + 2nm 


calculate the values of n and m that may be recommended for 
a subsequent survey in order that the district mean yield may be 
estimated with standard error of 12 025. per plot for the minimum 
cost. 


ye 


Da tien 
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From the last row of Table 7.2, we obtain 


k 
2 (т — 1) 5,2 


9 __ tal 
Pooled 5,2 = Poe 
4 x (46655- :2) 
= чш 
= 6665-0 
k 
Хп (т — 1) St? 
Pooled Soe = EC іі EN 
_ 66979-5 
dec 
= 9568.5 


On substituting in (25), we obtain 


Est. 5,2 = 6665-0 — 256-3 


= 3475 -5 


апа 
Est. 5,2 = 9568-5 


The variance of the district mean when finite multipliers at the 
sub-sampling stage are ignored is given by 


РА k 
ма ‚ 52 1 ма 
V Fam) = S? р (5- м) 35 "m n N? 


where p; = NN. 
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TABLE 7.2 
Yield Survey on Paddy, West Godavari District (India), 1946-47 


Values of Means, Mean Squares between Village Means (su?) and 
Mean Squares within Villages (5?) per Plot Basis 


No. of Villages 
Stratum in the Sample 


Н 2 N: Ne 
Number Population Sample Mean Sw? 8,2 -- TNT 
ШІ 08 уа fPlot) à x N N 
М, т 
1 88 5 347-5 1452-1 2791-5 + 109863 0-012070 
2 142 5 297.8 1937-0 27422-5 +177278 0-031427 
3 119 5 201-1 7107-2 1864-0 -148564 0-022071 
4 90 5 438-9 9603-9 11824-0 +112360 0-012625 
s 114 $ 282-9 20702-0 13628-2 -142322 0-020256 
6 102 5 301-9 2510-8 2007.5 +127341 0-016216 
7 146 5 186-7 3342.2 7441-8 :182272 0-033223 
Total 801 35 46655 :2 66979-5 0-147888 
Hence 


Est. V (5,,) = 3475-5 {0-147888 = 0-001248} + а. 


x 0-029578 
= 192-80 


whence 


S.E. Pan) = 13-89 oz.[plot 
But 


k 


Жан = X PP nm 
1=1 


= 282-90 oz./plot 
whence 


% S.E. of the estimate of district mean yield = 4:9 
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When the number of villages to be sampled is distributed 
between the various strata in proportion to Мұ, the variance is 


given by 
р A 1), 8. 
У Gan) = 83 (5 у) tum 
whence 
acs 
и 
кз oe 
V Gam) + W 


For V (ўњ) = 144 and Sp2/N = 4:3, this reduces to 


452 
S Ea 


148-3 


t 


п = 


Putting m — 1, 2, 3, 4 and 5 successively, we obtain the corres- 
ponding values of л. Substituting these in the equation for cost, 
we get the corresponding values of cost. The relevant calculations 
are given in Table 7.3. 


TABLE 7.3 
Calculations of the Optimum Sub-Sampling Rate 


5 2 8 = (3) 
ш т S? + “т. r= oq © 
a) e.t 09 (4) (5) 
1 . 9568-5 13044-0 88 792 
2 4784-2 ` 8259-7 56 616 
3 3189-5 6665-0 45 585 
а 2392.1 5867-6 40 600 
5 1913-7 5389-2 36 612 


It is seen that the cost is minimum when m — 3 and n — 45. 
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Alternatively, we can substitute directly in the equation for т, 
namely, 


which to the nearest integer gives 

т-3 
The corresponding value for n is obtained by substitution in the 
equation for the variance and is found to be 45. 


7.6 Comparison of Two-Stage with One-Stage Sampling 


One-stage sampling procedures comparable with two-stage 
sampling will involve either 


(i) sampling пт elements in one single stage, or 
(ii sampling of nm/M first-stage units as clusters, without 
further sub-sampling. 


The variance of the mean of a simple random sample of nm 
elements selected by procedure (i) is given by 


1 eg 
эт NM 5° (41) 
To examine how this compares with the variance of а two- 
stage sample, it is convenient to express the latter in terms of the 
intra-class correlation between elements of the first-stage units. 
Substituting for Sy? and бш? from (20) and (21) of the previous 
chapter in (10), we obtain 


x NM —1 $ 
У (Yam) Two-stage = NM E = г = E) а— р) 


ЕТ] 


_ NM—1 S [ mal) 


NM nm M(N —1) 
ТЕРГЕ 
М-т 
x |] (42) 


NNNM: ој чил = ай 


| 
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When the sub-sampling rate m/M is small, (42) may be approxi- 
mated by 


2, 5° У — 
У ыен» = = [1 Фе (R т=1) (43) 


Comparing this with (41), we notice that the relative change 
in variance using sub-sampling in place of unrestricted sampling 
of elements is approximately given by 


р (= Т т — 1) (44) 


This means that the relative efficiency of sub-sampling compared 
to unrestricted sampling of elements is approximately equal to 
that of sampling clusters of size т (№ — n)/(N — 1). For n small 
compared with N, the relative change in variance is approximated 
by p (m — 1). 


Next, the variance of the mean of an equivalent sample of 
nm[M clusters is given by 
M 1 2 
nm a 5, “| 


This exceeds (10) by 


1 (M 2 l cg 

=) (ва – i 52) (46) 
For N large and Sy? — §,,2/M > 0, the excess is approximated by 

1 (M 


showing that the smaller the sub-sampling rate m/M, the larger 
will be the reduction in variance of a two-stage sample over a 
cluster sample. When Sy? — §,,2/M < 0, sub-sampling will lead 
to loss of precision. 


77 Effect of Change іп Size of First-Stage Units on the Variance 


We have seen in Section 7.6 that the variance of the mean of 
a two-stage sample consisting of п first-stage units with т second- 
stage units from each can be expressed as 
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_ УМ —1 8 Dorn 
Уба) = NM ли [ M (N—1) 


М-пт, y. Mam 
+o tat yD x ^1] 


where p, represents the intra-class correlation within first-stage 
units of size M. We shall suppose now that the first-stage units 
are combined to give N/C new first-stage units with CM second- 
stage units each. The variance of the mean of a two-stage 
sample of size пт will then be given by 


u.c NM =1 S m(n—1) 
У Gun) = УМ nm Е 7 M(N—C) 


a {rane т. 
В М-С ме 


(MC — 1) — MC = 


MC 
where p, will now represent the intra-class correlation within 


first-stage units of size MC. The difference between the two 
variances can be expressed as 


j "G NM —1 S$ [ m(n—1)(C—1) 
Ow) = V On) = NM nm Е (М- 1) (М-С) 


+ ару — гара | 


where 
М-пт M-m 
aarc Dy 
Nane m МС — т 
^ cB мо Ме ~ ae 
Since 


aa = т С DG — (NM — 10) 
а= | (V—1) (М-С) тј о 
апа 


m (n — 1) (С — 1) 


M(N -D(N — CO) => 9 


we conclude that 


V) — V' 0,20 


| 
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Whenever p, > рә provided both p, and p, are positive. In other 
words, a gain in precision is brought about by enlarging first- 
Stage units whenever the intra-class correlation is positive and 
decreases as the size of the first-stage unit increases. ТЕ also 
follows that the smaller the value of р, the larger is the gain, 
so that by choosing for consolidation those first-stage units which 
are as different as possible the gain can be increased. Practical 
considerations, however, put a limit on the size to which the 
first-stage units can be increased since cost of sub-sampling 
increases with larger and larger areas. This increase in precision 
is to be weighed against the increase in cost. As an example, 
we shall mention that in crop surveys, the variance is decreased 
when an administrative circle comprising a group of villages is 
used in place of a village as the first-stage unit of sampling, but 
practical considerations of cost and administrative convenience 
favour the use of the village (Sukhatme, 1950). If cost were no 
consideration, the enlargement of first-stage units could proceed 
to a point of eliminating the use of first-stage units altogether 
and the second-stage units would be selected independently from 
the whole population. This elegant analysis is due to Hansen 
and Hurwitz (1943). 


7.8 Three-Stage Sampling, Equal First-Stage Units: Sample Mean 
and its Variance 
Let 


N = the number of first-stage units in the population 
M =the number of second-stage units in each of № first-stage 
units 
P =the number of third-stage units in each of NM second-stage 
units in the population 
and n, m and p the corresponding values in the sample 
Further, let 


4 


Yin, = the value of the k-th element in the j-th second-stage 
unit of the i-th first-stage unit 


1 Р 
Jy. = Р ) Увы 
k=1 


20 
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and 

Jior Ўцтр)» Жай 
or;, simply 

ED Ут» Йош 


denote the corresponding values for the sample, the use of paren- 
theses to distinguish numbers on which the mean is based from 
the serial numbers of the selected units being made only where 
necessary. 


We shall assume that the units at each stage are selected with 
equal probavility. 


It is easily shown that, as in two-stage sampling, the sample 
mean nmp provides an unbiased estimate of the population 
mean у... For, we һауе 


Eu) = Е | KA | (48) 


Since mp is a simple two-stage sample from the i-th first-stage 
unit, we have on substituting from (1) in (48), 


не b SER 0 
ECan) =E | Ж J 4 


= Tee (49) 
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To obtain the variance of Fnmp, we write 


У Fam) = E Onm — У.) 


= BO pan Ys Жы ЙЫ) 


> 1 5 = Low 1 : NECS E in 
= Е E Уо imp J. t+ n x ian i) Отн A) 


iA 


0-2 > ime ә G,.—3.2 | 
= E E p E [л = ul | || 


| п _ Р Я Е . К 
F n | Qu ~Ji) Oc Ji ee || 


+ Ебу + Іт бык 


% gu | (50) 


Since mp is a simple two-stage sample from the i-th first-stage 
unit, we have on substituting from (1) in the second and fourth 
terms and from (10) in the first term in (50), and noting that 


sampling from the i-th and i’-th first-stage units is carried out 
independently, 


V Pans) = Ley | n a)(i- 


E Qiu, —ЖЫ)* 
N 
= LG -ias + +a - 
п> Є 7 4) Б/ 
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= | 
m Lip = ia: in A 
Еті 
апа 
1 = 
5,° S NT 2 09, =й)" 
ізі | 


т 


On interchanging terms and writing 


IS = 
27 z 1 | 
52 = x? 58 and 62 Д ) ; 
іші 


we have finally 
бә (1 -3)s (4-4) И 


+ ( – 60 


н 


а 
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It will be seen that the variance of the sample mean is made 
up of three components corresponding to the three stages of 
sampling. If each of the лт selected second-stage units were com- 
pletely enumerated, in other words if p = P, the variance would 
be given by the first two terms appropriate to the two-stage 
Sampling design. If each of the л first-stage units in the sample 
were completely enumerated, in other words if m= M and p= Р, 
we will be left with the first term only representing the variance 
appropriate for one-stage sampling. 


When n= М, and from each of М units a two-stage sample 
is drawn, we shall be left with the second and third terms to 
represent the variance of the mean. The case will correspond 
to a stratified two-stage sampling design with the first-stage units 


in the population constituting the strata. 


Lastly, when finite multipliers are ignored, we get a simple 
ехргеѕѕіоп for the variance, namely, 


S? 85 , 52 
Vip = Eo. а: si 2 52 
Флит Ы ЗЫ 7 (52) 


7.9 Three-Stage Sampling, Equal First-Stage Units: Estimation 
from the Sample of the Variance of the Mean 
As a first step we shall derive the expected values of the 
Various mean squares in the sample. 
Since E (si?) = 8,5, we have from (20), 


E (52) = S2 
giving us 
Е (2) = 5,2 (53) 


Again, from (19), we have for a two-stage sample drawn from 
а specified first-stage unit, 


Е бг) = 82 + Ge DES 
Whence 


EGA) = 8,2 + Є – >) 5? (5% 
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Finally 
E(n —1) з} =E [2 би — Fame? } 
= {0 — Fam?) } 
=Е| SEGim? |i) | — nE Gam?) 
Now, from (10), we obtain 
yo Rp р в АЈ exp Ж! 1) 52 
Emè lD = {5.2 + (= — м)за + (1 – д) 2) 
whence, taking further expectations and using (51), we get 


n 


En-a = у (Z s + (1 ath Fe 
S = ја IP + (Ge ијЕ 5 


IX 5,2 sal 
+ G РЈ nm +. ) 


Ш IA 
puit 2 Se els ad 2 
= (n 1) { 5, at (4 M S, 
1 19; 
+ (5 DES 
or 
жеп ile Ent 1 1) би 
EG) 7$? + (2 — 1) 52 + HEP JE: (55) 


Substituting from (53) and (54) in (55), and from (53) in (54), 
we have 
I SN 
52 (56) 


мі, ча 
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апа 
Est. S? = s? 1 = 5 5 C - i ia (57) 
whence 
эи. убы =(1-— Des, - 105 + - Do 
(58) 


When N is large, the estimates of S,? and Sy? are still given by 
(56) and (57) but the estimate of the variance of the sample mean 
is given by the simple expression 


Est. Yu) => (59) 


Mean square between first-stage units in the 
_ analysis of variance table for the sample 
nmp 


When the other finite multipliers are also ignored, equations (54) 
to (57) reduce to 


E ($3) = 5,2 +, 5: (60) 
E (ва) =8@ + 38. + 2 (61) 
т тр . 

Est. §,2 = 5,2 — 1 БЫ . (62) 

р 
Est. S? = s — 5и? (63) 
т 


While the estimate of the variance of the sample mean is given 
by (59) as before. 


7.10 Distribution of the Sample among the Three Stages 
ea we shall consider the problem of optimum allocation 
the Sample among the three stages when the cost of the 
urvey is represented by 
C = сп + стт + camp (ба) 


whe a 
те с, c, and сз are positive constants. 
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From (64) and (51) deleting the bars over Sy? and Sp? for 
convenience, we consider the product 


с(у + 5 


- (seksa) + (аг ал) + ей 


oT E X {су + ст + camp} 
which can also be written as 


= {а (s; s x 5.) T (s. -i 52) + aS, 


| 
CETERE TI 


+ 
+{% (s: — 5 52) р + аб) 
+{в (ве– 1 S) mp + 4841 (65) 


The values of т and р which minimize (65) give the optimum 
allocation for both cases, when C is minimized for a given Vo, 
or when V is minimized for a given Су. 

When (Sy? — 6,2/М) and (Sw? — S,?/P) аге both positive, 
(65) can be rewritten as the sum of four square terms: 


с(у +S 


= NE (s; = u s.) mp — NEU : (66) 
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Clearly, (66) is minimum when the last three square terms are 
all zero; then /à is the nearest integer to 


i 
& (s. -> s) 


C2 (s; = ву 


(67) 


апа р is the nearest integer to 


AE (68) 
а (5,2 – -58 


It is interesting to note that the solution for p is independent of 
c, and Sy?. Indeed, р bears the same relationship to сз, сз, 
Sw? and 5,2 as т in (38) bears with c, с, Sp? and Sw’, as one 
would, in fact, expect. 


The above solutions presuppose that S,?—S,,?/M and 
Sy? — Sp?/P are both positive, which may not be so. 


Case 1 
Suppose Sy? — S,,2/M< 0, but Sy? — Sp?/P > 0. 


In this case, to minimize (65), т must assume the maximum 
attainable value, say т, which can also be equal to М, and 
р is to be such that 


TE 1 жыз 1 
рса (52 —5 s) + р c,S,? + mpeg (s: —M 5,2 


1 
i tm oS? (69) 
is minimum. 
If 


{в (s в 5^) + ст (58 — у, 5,^)} «o 


the expression (69) is minimum when p has the maximum attain- 
able value. 
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-< Та case this expression is positive, (69) is minimum when р is 
the nearest integer to 


“Garr 


1 2 
Са (52 = 82) + tic, (s; = a 52) 


(70) 


Case П 


Suppose next that both Sp? — S,?/M and Sw? — 6 УР аге 
negative or zero. 


Tn this case (65) is minimum when p bas the maximum attainable 
value, say f, which could also be equal to P, and т is to be 
chosen so that 


1 à ü Ј за“ € 
са (s? ж; 52) т-һ (82 E 5,2 = 
+ тро, (s ZI doas? (D 
3 b M LÀ pm 19р 
is minimum. Now (71) can be rewritten as 
т (52 SES °) (с, + а) + а (152 - p 5+ S.) (72) 
b M LÀ 2 3. m Р n P n 10 


which is minimum when m has the maximum attainable value. 
Case III 
Suppose now that Sy? — S,,2/M > 0, but Sy? — 5 ЈР< 0. 


` The solution in this case for pm is given by the nearest 
integer to 


C3 (s; = u So | (73) 


with p attaining the maximum and т the minimum possible 
values. 
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7.11 Two-Stage Sampling, Unequal First-Stage Units: Estimate 
of the Population Mean 


We shall now give the theory appropriate for unequal first-stage 
units. We shall assume selection with equal probability at each 
stage of selection. 


Let 
M, = the number of second-stage units in the /-th first-stage unit 
(i= 1,2, ... № 


m, — the number of second-stage units to be selected from the 
i-th first-stage unit, if in the sample 


М, =: the total number of second-stage units in the population 
N 
ie, ХУ M, 
іші 


то == the number of second-stage units in the sample 
" n N 
ie, Ут, or У am, 
4-1 


where a; — l, if the i-th first-stage unit is in the sample, and 
Otherwise zero. 


ті 


Vitm = n Ju 
i 
i 


mi 


atn = Yu [371 m, 2 Ун 


апа 
" I 
Жар = m, Vij 


Further, let 


2. E 1. 
УӘ = Yama — AF z Ju 
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and 


ісі ja . 


Now several estimates of the population mean y, сап be 


formed. The simplest is the simple mean of the first-stage unit 
means which we shall denote as s, given by 


Jr = ntn 


п. 
1 = 
7 Jim 


where the summation extends over the units in the sample. This 
can also be written as 


N 

Е 1 = 

Ја = 5 х 94 Ут (74) 
іші 


where аҙ is а random variable such that а; = 1 if the i-th first- 
stage unit is in the sample, and otherwise zero. 


A second estimate which we shall denote as Уз is based оп 
the first-stage unit totals, given by 


р x 


Va Gn) 


il SAY oe. 
nM e M; уто 


апа сап be written as 


| 


N 
EP 1 Ж 
Ky == nM x. a; (М, Yi) (75) 
iei 


where 


я М, Fim; is the estimate of ће population total for the ; 
stage unit he th first- 
and 


«e, = 1 ifthe i-th cluster is in the sample, and zero m 
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Yet another estimate which we shall denote as ўз” is the ratio 
estimate, defined by 


n 
е AMD 
Y, к= SA ) 
XM, 


and can be written as simply 


a 76 
Au. 5% 5%) 
ГА 
Where 
M, = Ўш 
up = M and й, = = i 


More generally, we shall form a ratio estimate of the population 
mean, 


Let 
X beasupplementary variate 


X. be the population mean, assumed known 


R be the population ratio > 


n 
Ein, i - 
Жута ait M Fimo 


and 


Then the ratio estimate of the population mean y., is defined by 


Pr = KR. (7) 


We shall study the properties of the different estimates: in the 
Next section. 
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7.12 Two-Stage Sampling, Unequal First-Stage Units: Expected 
Values and Variances of the Different Estimates 


(a) Estimate у; 
We write, from (74), 


N 
EG) = Е | ЭРИН of 


ісі 


= 1 ) Ебоју.. (78) 


Now, by definition, a; =1 if the i-th unit is in the sample, and 
is otherwise zero. Clearly, E (a;) is the probability of including 
the i-th first-stage unit in the sample. We have seen in Chapter 
II that, when units are selected with equal probability, E (а4) is 
n|N. We therefore have from (78), 


(79) 
thus showing that y, is a biased estimate of the population mean. 
We note that the probability of including a specified 
sample is independent of the unit when the selection 
are equal. In evaluating the expected values it is п 
necessary to introduce the random variable a, 
theorem of Section 2а.9 on expectations in C 
sufficient for the purpose. Thus, 


EG) = E [ ЖЕТ О 


unit in the 
probabilities 
ot, therefore, 
the use of the 
hapter Ц being 


— — мм 
(CO, ODE Eo! SS SSS == 
—————— арна  "—-_2"_—_ 
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[ 
ty 
Саты” 
M: 


| 
=| 


= Он; 


To evaluate the mean square error of Js, we write 
— у 


3, — 9, =F, — Pa + In. Ўн. Yn 7 У. 


whence, squaring both sides and taking expectations term by term, 
We get 


МЕ (ў) = E, — ў. + E On — Iu)? + E On. — 9.) 
4E, — 3.) Фа. — Pw.) 
+ 2E (9, — Pa.) би. — 9.) 
+ 2E Oa. — Ju.) би. — 9.) (80) 


Taking the first term in (80), we have 


E(,— у = Е | oec] 
- 2 n = 3.019 


+ 2 LEG 7 Gas Fe.) J 


ізгі” 


2 
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where 
Hi 
2 (у, =.) 
2 __ іші 
xum M,—1 


The value of the second term in (80) is obviously given by 


EG, —InJ* = (2— Ls? (82) 


where 


N 
ae Л liie 
S} = NEN Ж Oi. — Fn.) 
іші 


The third term іп (80) is unaffected by the sampling procedure 
and therefore a constant, being the square of the 
the estimate. The fourth term is clearly zero since 
of selecting a sub-sample from a given first- 


of the procedure of drawing a sample o 
have 


bias term in 
the procedure 
Stage unit is independent 
f first-stage units. We 


3 д. lise s РЦ 
Е (9,— Pn) On—Fy.) = Е | ТЕ Фил» ІҢ = 


250 (83) 


The fifth and the sixth terms а We are there- 


fore left with 


M.S.E. (ӯ) = ¢ = x) Spot zi À (5 Е и) u 


те Obviously zero. 


+ Ow. — ӯ.) 
= VO) + Gy —y.* (84) 
where 
V6) = EG, — јаје 
1.74 И. - 7 2 1 
“=з Зи + aw DG – 10) зг (85) 


SS 
5 ——— A ">" ——_—_—_ ЦД кекке нн а -- Ll нн: 
= À————— 


— 
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The bias component arises because of the failure to give equal 
chance to each second-stage unit being included in the sample. 
Unless the M; vary considerably and the character is correlated 
with М, this component may not be serious; but it is important 
to collect evidence on this point before adopting у; as the estimate 
of ӯ... The procedure of estimation in the yield surveys in India 
provides an instance in point (Sukhatme and Panse, 1951). The 
number of fields under the crop is often found to vary considerably 
from one village to another, thereby pointing to the need for 
testing the nature and magnitude of the bias arising from the use of 
the simple arithmetic mean of plot yields to estimate the average 
yield per plot. Table 7.4 gives a comparison of the yield 
estimates in large samples based on ју; with those calculated from 
an alternative estimate Ру which, as will be presently shown, 
provides an unbiased estimate of the yield per unit. The standard 
error of the difference between the two estimates is known to be 


TABLE 7.4 
Mean Yield in the Wheat Survey in Punjab, 1943-44 


Yield in Lb. per Acre 


риши “Simple Arithmetic Weighted Mean 
Mean F, » 
1. Amritsar 5 1029 1041 
2. Gurdaspur за 829 862 
3. Jullundur 35 839 881 
4. Hoshiarpur à 804 796 
5. Ludhiana aa 1247 1246 
6. Ferozepur > 1052 1079 
7. Ambala T" 854 820 
8. Karnal ei 839 868 
9. Hissar 5 1090 1142 
10. Rohtak «ж 1004 997 
11. Gurgaon "Т 766 752 


Province ae 920 927 


21 
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of the order of 6 to 8%. It will be seen that not only do the 
differences change sign from district to district but also their 
magnitude is negligible compared to their standard errors. 


(b) Estimate ӯ 


The estimate corresponds to y, of the previous chapter and 
like Fn. provides an unbiased estimate of the population mean. 


For, 
EG.) = Е іш А E (Msn | of 


Zum is 32 мә, 


N 
= зр) Mò 
іші 
= ӯ. (86) 


To obtain the sampling variance of y,', we write 
VOL —E0;-—Y.Y 
EXE -In s — X." 
EXE РЕС. y. y 
HEG; — ў.) O — 9.) (87) 
Taking the first term in (87), we have 


n 2 
1 = я 
Е іш 2 М, (ипо -3l 


І i РВ 
= М? Е > МЕ (Gio – у) 15 


| 


Е (ў, — ўл.) 


+) MME (аы) Orme) Д 


iz 
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| 
ŠJ 
=! 
EO 
| 
3|- 
N 
т 


(88) 
Also, we know that 
1 1 Ж х 
809-99 =(„— у) S” (89) 
where 
N 
"ol | Min og | 
Si N-1 2, e ; ) 
n 
= нет) EM (90) 


The last term in (87) is clearly zero and we are left with 


N 
= 22 71 1 T 1 M? (1 к” 1 5 
VO.) = G = x) Sy s nN Ў, ТЕ (s. М) 52 (91) 


It will be noticed that the first component depends upon the 
variation between the cluster totals. It can be shown, in general, 
to be larger than the corresponding component in (85), provided 
the correlation between the cluster size and the cluster mean is 
positive and the bias component (jy, — Y.) is negligible. The 
second component of (91) is also likely to be larger than the 
corresponding component of (85) as there is likely to be positive 
correlation between М; and Sj. Unless the bias in ys is therefore 
likely to be serious, the estimate y,' may not be preferred to Js. 


(c) Ratio Estimate ys" 


We shall assume that the number of first-stage units in the 
sample is large enough to neglect the bias term in the expected 
value of a ratio estimate. То a first approximation, then 
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EG) = Е) 


=}. (92) 
since 


E(é,) =1 
To derive the sampling variance of js", we make use of the 


result (28) of Chapter IV and write, since üy =1, to a first 
approximation 


VO.) = VG) +) V (à) — 2). Cov (9, 0,) (93) 
Now, V (y,') is known from (91); V (ip) is given by 
vā) = G- x) 52 : (94) 


where 


and 
Cov (,', à) = Е (й, — 1) P; — ».) 
= Е [(@, — 1) Et, —»,)]|i) 
+ Е{(„/ — P.) (8, — 1) 


-(:-РУмн n (uj, — 9.) ш of (95) 


Collecting together the terms, we get 


Убу) = (2-3 i oer EFRA Low 
“г үй. 


ia 
$i 

| 
Alt 

~ 
5 


— 
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which, on simplification, gives 


N 
"TUR whl... T 
Yo» = (1 л) в д 2; “(э = м) 8° 99 


where 


1 Š 
S,? = ER E иг 6. ==)? 


It will be noticed that the second term of (96) is identical with 
the second term of (91); the first term, on the other hand, is 
expected to be less than the corresponding term of (91), if Miyi, 
and M; are positively correlated and the correlation is greater 
than one-half. It will, however, in general, be larger than the 
corresponding term in (85), provided М; and ();- P.. ) are posi- 
tively correlated and the bias (y,,— J.) is negligible. Data on 
the estimation of yield of crops provide an example of the relative 
efficiency of the three estimates js, ӯ; and Js”. Table 7.5 gives 
the percentage standard errors of the three estimates, made on 
four different yield surveys: wheat survey in U.P. during 1947-48; 
wheat survey in Delhi during 1948-49; and cotton surveys in 
Madhya Pradesh during 1944-45 and 1945-46. It is seen that 
the unweighted mean of plot yields (s) has the least standard 
error, considerably less than those of the other two estimates. 
This estimate is, of course, biased; but the bias, as we saw in 
Table 7.4, which is typical of these surveys, is found to be 
negligible for all practical purposes. In crop surveys in India, 


TABLE 7.5 


Percentage Standard Errors of Different Estimates 
of Mean Yield 


Vs Ys Уз 
Wheat (U.P.), 1947-48 3-7 14-0 4-7 
Wheat (Delhi), 1948-49 2:5 10:0 Se 
Cotton (Madhya Pradesh), 1944-45 .. 5-5 15-0 11:3 
1945-46 . 9 14-0 13.2 


326 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


the Mis are found to vary considerably from village to village, 
with the result that the estimate y,' turns out to be markedly 
inefficient, as shown in Table 7.5. 


(d) Ratio Estimate у, 


We shall assume that is large enough to ignore the bias terms 
of the first and higher orders in the expected value of the estimate 
Js. To obtain the variance, we write using (28) of Chapter IV, 


xi УХ 


(97) 


903 - y 376904 EGO. 208054) 


Now, by analogy with (91), 


VG) © G- x) 5и 15 E: DE «x - | je) 5,3 98) 


V (95') is known from (91) and may be written as 


VG) =(;— өзг un ж ar) 5 (99) 
апа 
Cov (ў, ¥,') = Е (ў, — P.) (%' —х_)) 


= = Via, у) 


X Ө — Fy! +%/—%) 


"EO 3. бы = xy (100) 


since expectations of the other two product terms are zero. 
Taking the first term in (100), we have 


ET. =) (у x) 


=E || у и Sn) | Уш о] 


‘ 
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i. n = А | ) 
mE E p и? Vung = Pa) Goa = Хо) 


n 


+ E uus (Fiona Fi.) бео] 


iz 


p иЗЕ {Hien Ji.) (Simp — u) | J 


+ + E р циуЕ {цо 4.) (Erime Хи. )] і, Д 


| 
ы 
бт 


(oci 
1 SV а 
= 25 e| и; (= — x) 3 
N 
1 1 1 
== ДШ, mo e 101 
nN n (= м) Si, up 
іші 
where 
1 м Bs E 
5, = Mi (О — J.) Оң, — x.) (102) 


The second term in (100) gives 


КЕ Г gs 
Еби— у) 2—5) = (5 — g) Sin (103) 
where 
Я 1 чн = а 5 
Su. = Nl = (у, — Ӯ.) (их, — X.) (104) 
=] 


so that from (100), (101) апа (103), 


Cov (»/, x,) = n N Sie: 
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Substituting in (97) from (98), (99) and (105), we get 


Vi (а) = G N 


N 
І sft n 
© ow Li © le = x) др (106) 
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where 
Dj? = S,? + RS, — 2RS,,, (107) 
When xj; = 1, (107) becomes 
рё = 8,2 


and the variance (106) reduces to the expression (96) as expected. 


In general, when Муз vary rather considerably the estimate y, 
is likely to be the most efficient of the four estimates, provided 
n is large and x is highly correlated with y. 


7.13 Two-Stage Sampling, Unequal First-Stage Units: Estimation 
of the Variances from the Sample 


(а) Mean of Cluster Means Estimate ys 
Consider the mean square between cluster means in the sample, 
55" as defined by 
„5 = 2 Ow 
52 к= 
п--1 


== 52 
= 2 ШІСІ кешін 


SE (108) 


Multiplying both sides of (108) by (n — 1) and taking expecta- 
tions, we obtain 


(ne TAGs): = Ё p E GG Ш = nE(y?) 


ч p pr Y (ж ч Ж; s| — nE (7,2) 


|| 
Шіл 
TE 
ЕТЕ 
3|- 
Ni 

т 
еә 


—n Bn? 262): 
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whence, substituting from (85) and remembering that 


5 
Өг = Есі X (1. =) 


we obtain т 
2 1 1 1 ° 
Е (52) = 5? tm ЈА (= x құ Чоў 
іші 
Also 
n М 
(1 1 | i 1 1 
Fl 25-0 “| = у 2 (к= м) 9° (1 
where 
mi 
Я 1 NES. 
57 = "pol 2 (у — има)“ (11) 
Непсе i 
Ve L _ Түз 
Est. S,? = 52 — > Ў, e м) se (112) 


Substituting from (110) and (112) in (85), we obtain 


лі каз ФҮРІЕ 15, 
Est. FO) = (5 x) w c aW у (< = x) 52 (113) 


It should be pointed out that sp? as defined in (108) cannot be 
derived from the mean square between clusters in the table 
showing the analysis of variance on у; unless m;’s are all equal. 
In most surveys, however, as will be shown in the next section, 
т will have a constant value within a stratum, although varying 
from stratum to stratum. Consequently, if analysis of variance 
is carried out separately for each stratum, we can estimate the 
variance of the mean for that stratum by substituting from the 
analysis of variance table. Thus, if B is the mean square between 
clusters and W the mean square within clusters in the sample 
from any stratum, and further, the within-variance S;? is assumed 


to be constant for all i, we obtain « 


Б 1 Ix B 4 (с 1 Ww 
Est. VG) = (; x) m` N \m in) 


p 
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where Mp is the harmonic mean of the Mys in the sample. 
When М; = М (i —1,2, ..., М), and m; =m(i =1, 2, ..., М), 
the formule become identical with those in Section 7.4. 


For N large, the variance can be computed from the simple 
expression 


Est. VG) = 5 (114) 


(b) Mean of Cluster Totals Estimate уз 


Consider the mean square per element basis of the cluster 
totals in the sample defined by 


TNR TE. 
nt = „= *e-L) (115) 


On expanding and taking expectations, we get 


(n —1)E(s9) = Е ix ra E (o Ш — nE (53) 


п č [эв (1 _ | 2 
N М? (o. ' (% ж) s?} 


піў? + О) 


N 
is ” 1 M? (1 1 " 
Ебу) = $* x Au мМ (116) 
Also 
А meyi | зума 
Е 3» ч. - x) sè! ha - г) S? 
| M? Ми М; ) N M? Хт; М, 
ісі 
(117) 
Непсе 
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On substituting in (91) from (117) and (118), we get 


eM, Лу Ый лш _ Де , 
odd." G x) ui: 2, М? Nm; x) 5i 


(119) 


When rs are equal, the variance may be calculated direct 


ly 


from the analysis of variance on Miyij/M, as explained in the 


previous section, provided S; is assumed to be constant for all 


For N large, we have the simple expression 


Est. V (5) = 2 (12 
(c) Ratio Estimate ys" 
Let 
по __1 y Ma T 
ure term М? (щш) — Да) (12 


On substituting for y," and expanding, we have 


M? (n — 1) 52 = Із MS —2 (2 um 


x 


i 


Зи 2 п 
2. Мау, dr sonde 7.4 MPP у 
р 


М, 


|| 


+ 2 МРМ Fin Уға) 


Sme)» қ 
+- 252 | zo ож 
М, и шы yc MM equae 


Ш x n 2 
s Шы) + (2 ме) È M Dimo 
УМ, УМ, 


i. 


0) 


1) 


| 
| 
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Taking expectations for a fixed sample of first-stage units term 
by term, we obtain 


Ro) E607) weis + (= arse] 


[нере Go 


+ S MEM је сна Мг 45,2 
iet = | (Ем) р ү 


"xx £o ed] 
Guy N 
E С ме 
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Combining the terms in уз and those in S;*’s separately and 


putting 
5 Е ы УМ 
Ў м, 
we һауе 


i UOTIS 
Eni [Die 7 pn у 9s» 


Taking further expectation over first-stage samples of n and 
using the result (43) from Chapter IV, namely, that in large 


samples, 


1 J МЕ adta В 
2 VE (pi 5 = 8, 
E = Mr Qi Vn.) | v 
we obtain 


TUER м? 
Еу) = S la 47 in 


1 сім, bur 
Хм, (м) 
whence 
Ug са во 1 " 
991220 


" | _ 2M | iur] 
“М, (£ м) ) 


| 


(122) 


(123) 


(124) 


-JEF > ы 
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On substituting from (117) and (124) in (96), we get 


Est. V (y,") -G = g)at4 c 19° : = 27 гё} 
pue 


2 "M? 
x ( x 28 T A, | 
йм, (m) 
or, to a first approximation, 


= (1 д) the Li 


(d) Ratio Estimate у; 


The steps leading to the estimation from the sample of the 
variance of jy, are similar to those given in (c) above. We 
shall quote here only the final result. We write, to a first 
approximation, 


: j. d Y par "uu 
Est. У, (yg) = n Njn—i 2 и (би) — инә? 


п—1 


1 NN 2/4 1 қ 
2-227 e м) dà (126) 


where 


d? = s? + Ке — 2R,s, 


dyr 


(127) 


7.14 Тһе Use of the Analysis of Variance Method to Compute 
the Variance of the Sample Mean 


It has been pointed out already that unless mis are equal it is 
not valid to evaluate the variance of the estimate using the 
analysis of variance table. Nevertheless, for moderate inequality 
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in ms, the method with certain adjustments has been recom- 
mended for use (Yates, 1949). The method consists in calculating 
a number A given by 


fim, - int 
Tad A (128) 


and using it to estimate Sp? by means of 


Est. 52 = 2 = E (129) 


where B and W denote the mean squares in the analysis of 
variance table for the sample. On account of its simplicity 
the method makes an appeal to the practical worker. It is 
therefore important to examine the conditions when it can be 
used. 


Let us suppose that 


(а) М, the number of first-stage units in the population, is 
large; 


(b) Мі, the number of second-stage units in each first-stage unit, 
is equal to M; 
(c) m; is the number of second-stage units to be drawn from 

the first-stage unit selected at the i-th draw (i=1, 2, ...,7); 


and 
(d) S? is constant for all i and equal to, say, 8,2. 
Consider the estimate 


> 1 
Уш, | Vij (130) 


п 
where ту has the usual meaning X mj. It is easy to see that 


this is an unbiased estimate of the population mean Y.. For, 
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EQ) = = т Е(Уцьә n 


e E x mot 
Mo 


1 ғ 
ж Ps mE (F) 


= |, Р (131) 


The variance of Fm, is given by 


У (ы) Е Са 0 


] Е {5 т; бао), 


ж ( 


| 


= | E{s т? Fiam У.) 


= а 
n a ant: % 
+2 mme Vina — У.) бео — У. jl 
i “” 


1 Мен ш = = SUR 
а #12 PP Ону — Jl 9E 
ты 


+ 2 mmy Gas — да Ў У.) 
ii’ 


х Oran де. + Ўе.—Ў..)} 
= xx E [2 m; {Sima — у. Уы, =У. JE 
то 
+2 Pima – у) i. ie »i 
E тте (Gua = 9 бақа = Ји.) 
+ Bama — X Gv —3.) 


alg (Wi. x У.) (Ристу ao Ру.) 


+ б. =.) бе– АҢ 
22 
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= па [È me E Cry Јо, + EG — I. 
Emme EGF.) бе.) 
Ем 


since the expectations of all the other terms are clearly zero. 
Hence 


Уба) = ыз һ тг | x — м) 82+ 5 | 


То 


аз ре. T te 
= | hup 5,2 (132). 


neglecting terms containing 1/N. To evaluate 5,7, we start with 
B, the mean square in the analysis of variance for the sample 
and take expectations. We obtain 


E(B.—E Б m, E 


1 n Ev = je 
к Е Bm MP = mul 


1 " н e 
п—1 [ 2 ue Lm m, E 9) 


— ту +V | 


І 


ры ТБ т, = 
п = 1 [5 (m – то ) + =15, 


| == ASSESS (1 = x) (133) 


bea 
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whence 
А 


в-ж(1 = 
Est. 52 = х М (134) 


For M large, we have 


пакры. 8 (135) 


We may infer that if the sampling fraction at each stage is 
small and the variation in the size of first-stage units is negligible, 
the method may give reliable results. Its use under conditions 
other than those specified above will need to be justified. It should 
also be mentioned that the system of drawing т; second-stage units 
at the i-th draw irrespective of which first-stage unit is included 
in the sample at the i-th draw is not a rational system which is 
likely to be used in practice. In an efficient survey design, эң 5 
will be usually equal within a stratum, although it is likely that 
through extraneous causes the numbers of second-stage units 
actually collected in the sample may be unequal. If these 
extraneous causes are random causes, in other words, if m, т», 

++, Ту, can be considered a random sample from the respective 
first-stage units and further the variation among Му is not large, 
this method as in (134), may give a sufficiently good estimate 
of the variance when М is large. 


7.15 Allocation of Sample 


In our discussion so far, we have assumed that the number of 
Second-stage units to be drawn from the i-th first-stage unit, 
namely 7%, is any arbitrary number less than Mi. It may be 
related in some way to the size of the i-th unit, as for example, 
when it is proportional to Mj, or it may be independent of it. 
The guiding principle in choosing it is clearly to maximize the 


precision of the estimate for given cost or minimize the cost for 
desired precision. 


We shall suppose that the total cost consists of two components, 
one depending upon the number of first-stage units in the sample 
суп, and the other on the total number of second-stage units in the 
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sample, namely с, 5 mi. This second component will, however, 
vary from sample to sample of n first-stage units. We shall 
therefore consider the average cost instead of the actuai cost of 
surveying a sample, given by 


N 
С = сп + с; x X т; (136) 
іші 


and proceed to determine the optimum allocation. We shall 


suppose that the estimate to be used is the unbiased 
estimate Js’. 


Now, from (91) and (136), we obtain 


N 


(reo +5р)у°-{з^— па (д ма) 


іші 


DN M? Ss? = 
tye ө («+9 Xm) 
i=1 


(137) 
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Assuming 4 to be positive, the right-hand side of (138) can be 
put in the form 


N N 
о VM? oo 2 у к=з 
ip == Жа: ЗЕ /сіса4 MS; 
c,á ме m е NA 4/сіса4 М;5, 
іші і=1 
N 
2с, 
+ 22. D, MM SS, 
N*M* 
imi'zi 
= MSN: 
РАСА 


іші 


N -- — о 
Ca т ус, my ys.) 
+ us ), (М т MS; | z MS) (140) 


і>іті 


and this is minimum when each of the two square terms is zero, 
giving us 


m = а S (і--1,2,..., М) (141) 


We notice that zm; is proportional to the product of three factors: 
the first depending upon the cost factors, the second upon the 
size of the selected unit and the third on the variation of the 
character under study within the selected unit. Advance know- 
ledge of 5:2 is, however, difficult to obtain. Practical considera- 
tions require that т; should be independent of S?, even if this 
means departing somewhat from the optimum. Опе method of 
choosing т; is to make it proportional to М. This would 
imply the assumption that 5,2 is constant for all i. Usually S? 
will be found to increase with М, although perhaps not as fast 
as М. One method of reducing the dependence of m on S? 
is to group together into strata first-stage units of about the same 
size, provided stratification by size does not prevent stratification 
by other and more important characters, and choose m; propor- 
tional to M; within the several strata. 
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Suppose then т; is so determined as to be proportional to 
Mi, say, 


m, = kM, (142) 


where k is some positive constant. Substituting from (142) in 
(137) for mi, we get 


іші 


N 
(ro) ja 52, G= \4 m р А i уз ust} (о-о) 


N 
Uy, MS; 


С N ў 5 
%2,/ NM 2 MS; 
қ ICA m, mo 
rla NEN k EMSi— vasi) 
which is minimum for (143) 
(144) 
(145) 


where р may be termed the average intra- 


class correlation over 
is determined by the 
tier for the case of 
1S obtained from (91) 
advance and C or V 


all units in the stratum. It follows that К 
same considerations as those discussed ea 
equal clusters. Knowing k, the value of a 
or (136) according as V or C is fixed in 
is minimized. 

The reader may verify that the o 
by the same formule as those pre: 
the other consistent estimates is 


ptimum allocation is governed 
sented here even when any of 
used. Thus, for the estimate 


SUB-SAMPLING 343 


ђе, he will notice that the sampling variance has the same form 
as that for the estimate 2) except that Sy? is replaced by Sp”. 
It follows that the optimum value of m; is so determined that 


For S; constant, we obtain 
m; = КМ; 
where k' is a positive constant. 


Example 7.2 


A corn borer survey is carried out every fall in Iowa (U.S.A.) 
for estimating the number of borers per plant. Fifty sampling 
units of 25 stalks are selected at random from each district, and 
for each sampling unit the number of corn borers per stalk is 
estimated. Since it is costly to dissect all of the infested plants 
in each sampling unit, a sub-sample of two is dissected to obtain 
the estimate of the number of corn borers per infested plant. 
When the number of infested plants in a sampling unit is one, 
that is dissected. 


Two methods of estimation are followed: 


(a) The first method consists in computing the simple mean of 
the number of borers per plant from each sampling unit. 
Thus, if M; is the number of infested plants in the i-th 
sampling unit and Jima the estimated number of borers 
per infested plant in the i-th sampling unit and п the 
number of units in the sample, then an estimate of р, the 
number of borers per 25 plants, is given by 


es k 
бу == Я X МУ ом) 
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(6) The other method of estimation is to compute Z», where 


» п 
E ЖМ... Олар = M 


п " n(mi) 
Le., the product of the avera 
unit and the average number 
per sampling unit. 


Columns 2, 3 and 4 of Tab 
one district. Obtain 
they are unbiased esti 
district. Also calculat 
mates. 


ge infestation per sampling 
of borers per infested plant 


le 7.6 contain the relevant data for 

the two estimates. Examine whether 
mates of the population value ш for the 
€ the mean square errors of the two esti- 


The estimate z, Corresponds to the estimate Js’ in the text. 
It is therefore an unbiased estimate of the population value. In 
Cols. 2, 5 and 6 of Table 7.6 are given the values of Mi, Jim; and 
of the products Муту, called gi, and the means of all the three 
for the 50 units in the sample. We find that 


n 


1 
Z, = 5 Z = 12:4 


The variance of z, can be directly calculated from (120) after 
putting И 


= 1 in that expression, We obtain 


Est, V(z)-— : Ра 


-- 1 У 9 -9 
Pings 1) 12 = — те 


_ 19463 — 7750 
С 50x49 
= 4-78 


Тигпїпр {о the Second estimate, on Substituting from Table 7.6 
the values of Му and Уты) in the expression for Zis 


2, = 13:34 0:80 = 10.7 


we obtain 
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To obtain the expected value of z» we have 


EG) = s E (( м) (2 ебим1ђ)) 


-k (E) (вл) 


= p E (È MA + È Mie) 


tA’ 


РЕЗЧЕ: 
= р {ш +(n—1) My } 


for large N. 


Define 
Е(М, — М) Оң. — Pn.) 
VE(M,— M} E (%— Fn) 


Then we can put 


and 


or 
n=l 
E (т) = | Е= A р5,5, 
Since N is large. We note that Zə is a biased estimate of pu. 


To evaluate the bias we need the estimates of p, S, and Sy, 
These can be taken directly from Table 7.6. We have 
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MES б. (erze) (9) ee 
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Ёё =r = 0:261 
6,2 = 52 = 65-58 
би = Sy = 8:10 


while 5,2 is obtained from (112), being given by 


4 ӘСІ Їй ж. 
С) (пете 5 


1 
= 07347 — a (14-42) 


= 0-4463 
5, = 0-668 
Hence 
Не шесі 8565,5, 


22 (0-261) (8-10) (0-668) 
= 1-38 


The derivation of the variance of 2 
large N and п, and n/N negligible, 
write to a first approximation, 


2 is rather complex, but for 
it presents no difficulty. "We 
V (Мы) E: IV (M,) + MeV Pamp) 


iy M И y 
whence to terms in 1/n, we get + 29. M Cov (M,, Patmo) 


NR zs rj 2 М? 2 ~ 
Est. V (MF ntm) = ыр ti Fp $95 23M, Es 


On substituting from Table 7.6, we get 


Est. У (ы) = 0:64 (65-58) 4. 078) (794) 


a 2 (0-8) (13-34) RU (8-10) (0-857) 


= 0:839 + 2.616 + 0-773 
= 423 
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and 
МЕ. (2) = V (Z2) + bias? 
= 4-23 + 1:90 
= 6-13 


716 Stratified Sub-Sampling 


By far the most common design in surveys is stratified multi- 
stage sampling. In this design the population of first-stage units 
is first divided into strata, within each stratum a sample of 
first-stage units is selected and each of the selected first-stage units 
is further sub-sampled. Crop surveys with the subdivision as 
the stratum, described in Example 7.1 and the corn borer survey 
with the district as the stratum described in Example 7.2, are 
examples of this design. In this section we shall give the formule 
for the estimate of the population mean in stratified two-stage 
sampling, and its variance. We shall consider the unbiased 
estimate only. 


Let the population be divided into k strata with М; first-stage 
units in the ¢-th stratum, so that 


E 
УМ = М 


Further, we shall denote by My the number of second-stage units 
in the i-th first-stage unit of the ¢-th stratum, with Mto denoting 
the total number of second-stage units in the /-th stratum, i.e., 


м 
M, = P Ма = УМ, 


Let n; denote the number of first-stage units to be included in the 
sample from the ¢-th stratum, so that 


> 5 
n em 
t=1 


and m; denote the number of second-stage units to be sampled 
from the i-th selected first-stage unit. 
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Following the previous notation, we shall denote by 


Jı. = the population mean per second-stage unit in the i-th 
stratum 


Ne Мы 


з и: Ж X Уу 


іші ј=1 


Ju = the corresponding sample mean for the t-th stratum 
1 nt 4 ШИ 
aa === Ds, Уш 
Min, та 

i і 


У. = the population mean per second-stage unit 


k = 
X My, Xi. 


and 


У, = the corresponding sample estimate 


А 


k 
= EX». (146) 


Clearly, Pw is an unbiased estimate of the Population mean 
while its variance is given by i 


5 k 
V s =2 л И (Ӯ) 


_ = 
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Substituting from (91), we have 


roe Ye - 3) ве 


Nt 
1 Muf d... å 
* nN, M? E и) Su | an 
іші 
where 
| Nt M; 
в DR зер 
Sn "ет ја = Fu) 
апа (148) 
| Mti 
5, (M, —i) Qu; — Ји. 


and estimates of S’? and Sg? are provided by 


nt 
Est. Sy’? = s, нг s, - теуі 
ы Ы т Mii 


п, 


" А (149) 


1 У — Јат) 


та — 1 


Est, 85,8 == 3,7 = 


j 


Formule for the mean and its variance in stratified sampling 
appropriate for the case of equal clusters follow as special cases. 
Thus, for Ми = Мұ and ти = тұ, we have 


k 


= 2 № Finini 
=1 
where 
нүз = nm, По Ju; = у 
апа 
— М” 


k 
È NM, 
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and for its variance 


k | 
У ECT HS iud i Sys 2 
VGds= У л, s га) Su і п, т; m) d 
(= 


ућеге 


апа 
M, 


SN 
= мету», Qu; — Ји.) 
ігі i 


ji 


and estimates of Sm? and Si? are provided by the same formule 
as in (110) and (112), namely, 


1 1 
Est. 5,2 = sp? — Gc ч м) 5° | 


^ (150) 
Est. S, = 5, 


So that 


k 


= ro Ped = a) o 


іші 
ЖОЛ 1 = 
tar - и) | аз1) 
7.17 Efficiency of Stratification in Sub-Sampling 


We shall consider the simple case for which My = M. Further, 
we shall suppose that ти = т, So that the total number of second- 
stage units to be included in the sample is a fixed number nm 
whether drawn as a stratified or unstratified sample. In this section 
we shall estimate from a given Stratified sample, the difference 
between the variances of a stratified and an unstratified sample. 


If the sample were selected as an unstratified two-stage sample, 
the estimate of the population mean would be 


n 


И 1 т 
Yin = пт 2; Ji (152) 
j 


i 
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with the sampling variance given by 


52101555 
„= у) S + (у = им) = (153) 


where the letters * US" stand for “unstratified”, 


V Gus = 


S,? = the mean square between the first-stage unit means in the 
population 
1 М 
UN =>.) 154 
waa), 0-2.) (154) 
іші 


5,2 = the mean square between secondsstage units within first- 
stage units in the whole population 


= ти т) pu Оу. (155) 


іші ігі 


If the sample is a stratified sample, the estimate will be 
E k - 
Jw = 2 Ds 


where p; is the weight for the ¢-th stratum, and its sampling variance 


x 
Е A 2! 1 1 à 1 1 LN a 5 
Vs 24 р? 1G; = x) Sa? + mm ar) d 
(156) 


The relative excess of (153) over (156) represents the gain in 
precision due to stratification. For estimating this gain from the 
selected stratified sample, we require an estimate of Sy. We have 


(V-1)S2= Ў (ў) 
іші 


Е ON, ғ 
=2 2 Ou. — 7.) 
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Ыр, 5 E — 
Ou. —». T Уш — 7.) 
k E 
(N: Ж 1) Sa? F Ж N, Oi.. Ээ ЈЕ 
іші 
(157) 


4 k k 
= 2 (W,—1) бг + 2 Nevis Му У 
ігі іші 


The estimate of S? is known from (150), so that our problem 
reduces to estimating the second and third terms in (157). Now, 


from (10), we have 
V (рит) == У Qu) =E Ü) — y? 
Sw (158) 


ALE i mus BU 0, 6] 
чи м) e + (5 М) т, 


whence 
М E 1 ENG 
з = 2: Желе ee | or, SO RS гй Tue 
Est. y, = y. "n к) За М) n 
ог 
k k k 1 1 
Est Ds N, Ji? = 23 Ny — № N, (5. = ғ) 5.2 
іші іші іші 
L ges 
же Гы ue t6 
i x) A т, Эш 
іші 
160 
Also (190 


У (ода Е(Ба, Ves |; y 


so that 4 
a k 2 
Est. (Мў.2) = N (2 р, ju) — N-Est. V (ӯ), 
=1 


whence substituting from (156), we have 


k 2 k 
y 1 

) %»|- ) af EPN: 

пи ) Np, п, т) S? 


Est. (№ӯ 2) = « 
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Subtracting (161) from (160), we get 


-(1- a) EE- ғуы, 
m M \S п, п, te 
i 
1 1 E Мр? ү 
– да (ар x) – пе) би 
іші 


(162) 
whence on substituting in (157), we obtain 


k 
1 N, Npè E Np? ê 2 
ЗЕ NT a i SE Р Un N; Su 
Np? 
+ x m - x) |} Mir = w) Бы | 


(163) 


If the variation between second-stage units within first-stage units 
can be assumed to be of the same order from stratum to stratum, we 
can substitute ‘for Stw? its pooled estimate over all the strata, 
given by 


nt m 


A n к= 1) Ў д, 2,9 Jui — Ўн) | uen 


The difference between (153) and (156) when Sy? is estimated 
from (163) and St? from (150) represents the reduction in variance 
due to stratification (Sukhatme, 1950). 
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The difference assumes a simple form when рь= МИМ. On 
substituting from (163) in (153), we get 


п М 
N k 
“т М Di Qa — ул) 
іші 
k 5: 
ry һа = 
t=1 
1 ү М-п 
etus [ а NSL 


хва -» | Б (165) 


Also, from (156), we һауе 


| 
rea = Yo (d - ве + (а-ы 


Непсе (166) 


x 

М 1 1 27% 

Est. У див = G ше т | PS.” 
mm 


М-п + |" ла 
n(N —1) Ly Ou — Јо 


іші 


Est Wos = F) = 


k 


k 
тту e N—n 
m M n n (N—1) 


(167) 
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Substituting for $? from (150) the value 


CN „.. IN 52 
8% т м)“ 


we reach the simple expression 


" 

M a. 

Est. {Vus — Vs} = AND > Pt (ба — Fo)? 
ізі 


Pi 5 1.1.80) ° 
F 2, {2 п, N—1 n Ñ 5n 
(168) 


Which is seen to be identical with equation (73) of Chapter III. 
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CHAPTER УШ 


SUB-SAMPLING (Continued) 


8.1 Introduction 


In the preceding chapter we have developed the sampling 
theory appropriate for sub-sampling systems involving the use of 
equal probabilities of selection at each stage of sampling. When 
the first-stage units are large and vary considerably in their sizes, 
this system of sub-sampling is not usually efficient. This is even 
more so in cases where practical considerations demand that the 
survey should be confined to only a small number of first-stage 
units within each stratum with equal number of second-stage 
units from each first-stage unit, although the amount of sub- 
sampling from the selected first-stage units would be necessarily 
unequal under optimum allocation. A system of sub-sampling 
involving the use of varying probabilities has been used with 
considerable gains. in efficiency in such cases. In particular, a 
sub-sampling design in which only one first-stage unit is selected 
from each stratum, with probability proportional to the measure 
of the size of the unit, and a fixed number of second-stage units 
is selected with equal probabilities from each of the selected 
first-stage units, has been found to bring about 1 
ments in precision, compared with sub-sampling 
the use of equal probabilities. 
Hansen and Hurwitz (1943, 1949), 
the theory of sub-sampling systems 
probabilities, 


marked improve- 
systems involving 
The developments are due to 
In this chapter we shall give 
involving the use of varying 


8.2 Estimate of the Population Mean and its Variance 

We shall assume that t 
replacement. Further, we shall su 
first-stage unit of the population, 
sample, a sub-sample of mi 
therefrom without replacement 


Ppose that whenever a specified 
say the i-th, is included in the 
Second-stage units will be drawn 
» but only after the replacement of 
ave been drawn previously. In 
ppens to be selected, Say y times, 
ts, y sub-samples of тұ units each 


other words, if the i-th unit ha 
in a saniple of n first-stage uni 


he first-stage units are selected with - 


"© «чүч = чощ = чалы Те Н Ур ~ ~ 
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will be drawn therefrom independently of each other, each sub- 
sample of m; being drawn without replacement. 


Let P; denote the selection probability assigned to the i-th 


N 
first-stage unit of the population (; = 1, 2, ..., N) and 27 ub 
іші 


Further, let 
-— М | 


== * Эй 


deb 7 


Consider the estimate 


2, = 210) 
п 


1 
п 


Lm 


(1) 


(2) 


where the summation is taken over all the units in the sample. 
Then it is easily shown that z, is an unbiased estimate of Ns 


For, we have 


Е (2,) = Е | 2, 2 
= | ) Ебио | of 


where 
N 
2. = > Ра 
іші 
М 
DE 
== зу 34 
Е 0 
1=1 
= ў. 


(3) 


(5) 
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- To evaluate the sampling variance of Zg, We write 
V (2,) = Е {2, — Е (2): 

= Е (2, — z,)? 

= E (Eq — Zn. + Zp. — Z, )? 

= E (Zrim 2,.) + Е(2, — 2.) 

T 2E (Cimo — Zn.) Ca: — 2.) (6) 

Taking the first term in (6), we have 


n 2 
E Gim — 2.9 = E t DUE ED 
қ 5 
= ЕЁ È Сит) — 2+. )® 
E 2 (ит) — Zi.) (иста — ә} 


и 


Since the first-stage units are sub-sampled independently of each 
other, we may write 


В ? 1 jw 
Е (ит) —£,)* = з p Е (Zi, — 21)? 
TE E (£i —ZAu)* E (ит) те 2) 


isi’ 


= 2 eee se] 0 


eg I INA. 
b МОРЕ M,—1 ‘J Ох, =e 
ізі 


M? s; 


== „шш. 
С Мера *+ (8) 
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The value of the second term іп (6) is given by (77) of Chapter 
VI. We have 


E(z,—2,) = = (9) 


| 
Бый 

c 
- aN 
3E 
хы 

d 
К 


M _ pe (10) 


The expected value of the third term in (6) is obviously zero. 
We therefore obtain 


VG) = ы s IC - м) 5,2 а) 


which can also be alternatively written as 


N 

ти! MONA — 3 

V (2,) ZH n | М?Р, у 
i=1 


N 
1 ма (1 


+m La mee, (mn, x) 5 
= 8 
(12) 
When the selection probabilities are such that 
M, € 
P, = у (1,2; 2 IN) 


the expression for op,” is simplified, being given by 


N 
-I d Фа =.) == аз) 
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whence 


N N 
£ 1 Mic y» | ML à 
Ye) 7; 30-3 t; ) м s x) ® 
= ist (14) 


Whenever the mean square and the variance relate to the 
variate z, we have so indicated by adding 2 as a subscript to 


S and о. In all other cases, S and c should be taken to relate 
to the variate y only. 


8.3 Estimation of the Variance from the Sample 


Consider the mean square of Zim, Obtained from the sample 


i 2 > 
52 = ктт Ж (Zi Zim) (15) 
Expanding and taking expectations, we get 


n 


1 " = 
Е(5,) = n= |} Za) — “| 
талаптары AN 
= Wal Е (Z6) —nE cds (16) 


Since the first-stage units are selected 
consider our sample of n to be the 
samples of one each, and write the ri 


with replacement, we may 
result of n independent 
ght-hand side in (16) as 
1 

THT [50 Cems) 25 ае 25] 


(17) 


n based on a sample 


where V (Тит) is the variance of the mea 
of one first-stage unit. 


Substituting from (11) in (17), we get 


N 
1 
Е (5,7) =o? + 2 Р, (5. - м) 5,2 (18) 
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It follows that 


Est. VE) = ~ (19) 


8.4 Allocation of Sample 


So far we have placed no restriction on the size of the sub- 
sample to be drawn from a selected first-stage unit. It may be 
related to the selected unit or independent of it. The guiding 
principle in determining the optimum values of п and т; is clearly 
to maximize the precision for a given cost, or to minimize the 
cost for a desired precision. 


In a large number of surveys, the cost will be determined by . 
the number of different first-stage units and the total number of 
second-stage units in the sample. We shall suppose here that 
the cost of the survey is made up of two components, as follows: 


C — cn + сз p. т (20) 
where 
п' denotes the different first-stage units. included іп the 
sample, 
5 mi the total number of second-stage units in the 
sample, 
а the cost per first-stage unit on travel and setting 
up of an office, 
and 
с; the cost per second-stage unit of collecting the 


required information. 


Clearly, C will vary from sample to sample of n. In actual practice, 
one must be able to predict the cost in advance of designing the 
sample in order to be able to compute the optima. We shall, 
therefore, consider the average instead of the actual cost of 
surveying a sample. Now to obtain the average value of the 
first term in (20), we have 


N . t . 
сЕ (п) = су X 1-{Probability that the i-th unit is included at 
A least once in a sample of 7} 
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N 
= су X 141 — Probability that the i-th unit is not 
іші 
included in any of the n draws} 
N 
=а X {1 —(1 — PJ} (21) 
+=1 
The average value of the second component їп (20) is given by 
n N 
СЕ (È т) = сп X Рт; (22) 
іші 


Hence the average total cost of the survey will be represented 
by 


N 
(C = с; = 1—а— P) } + en 2 Pim, (23) 


This cost function, however, offers a slight disadvantage in that 
‘it is not simple to deal with. We shall, therefore, suppose that 
N is reasonably large and none of the Pi's too large, so that the 
average cost may be approximated by the simple function 


N 
С = сеп + con = Рут, (24) 


To determine the optima, the simp 


lest method woul 
consider the product of (11) with (24) о 


, and write 


x fate E Рт) (25) 


Clearly, the minimum value of (25) will produce optima when 
either C or V is fixed in advance and V or C is minimized 


Let 


5,2 
Ae =), Б» 
ть Рм, (26) 
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Equation (25) can then be rewritten and expanded as 


Уб) C= fa 3. у Р, x | Ley у Pml 


іші 


= 24+ са реви Е b» P, fa 52 3E ан) 
N 
& ps РР, {5 z 


Е Sp? 
= my + == m) (27) 
т, my 
i>i’=1 


Assuming 4 to be positive, (27) can be put in the form 


N N 
Vz): С=<4 + ад Р28,2 + 2 2 Р, V сүс, ДА б^ 
ті ті 
N 


+ 2c b» PPS, Ву. 


i>i=1 


N | ie eodd 2 
+) ТЕСЕ = vem 
EY WE res Р.Р, 


ііі 


- "ES Lx (28) 
ту 


This is minimum when each of the two square terms in (28) is 
zero, giving us for the optimum of т; the value 


= sc. 
жг М Cod Si 


=. [E CHR А P12, ҮСҮ 29 
ES "ES МЫР; 5; @ ^ ) ( ) 
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In other words, m; should be so determined that 


D — constant 
i-i 


M, ES im 30 
Mo VES (G1, 2). 2.5.) (30) 


However, $ will ordinarily not be known. It will also vary 
from character to character. Practical considerations require that 
m, should be independent of 5; even if it means somewhat 
departing from the optimum. In practice, S; will be generally 
found to increase with Mi, though seldom as fast as Mi. We 
shall assume here that S; is a constant equal to Sy, say, in 
which case m; would be so determined that 


Jur = у 


= а constant G —1,2, ..., М) (31) 


Knowing К, the value of n is obtained from (11) or (24), depend- 
ing upon whether the cost of the survey is minimized for fixed 
Vo or the variance is minimized for fixed Со. In the former case, 


N 
2 Er Ру ° 
tom 2 x (2 1) S, 


a> 


(32) 
and in the latter, 

pg (e^ 

CC Asa (33) 


We remark that when P; is Proportional to M;, the optimum 
value of „Mä 15 а constant, irrespective of the first-stage unit 
included in the sample. 


8,5% Determination of Optimum Probabilities 


In determining the optimum allocation of the s 
previous section we assumed tha 


were given. These probabilities 
proper fractions, subject to the 


ample in the 
t the selection. probabilities Pi 
can be any arbitrary positive 
Condition that their sum is 1; 
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alternatively, they can be related in a known way to the charac- 
teristics of the units to be selected. 


The optimum values of selection probabilities are given by 
minimizing the variance of 2, for given cost. In this section we 
Shall determine them assuming that: (a) the sub-sampling rate 
т/М for a specified first-stage unit i will be such that equation 
(31) is satisfied, and (b) the cost function is independent of Pj. 


Following the Lagrange procedure, we consider the function $ 
given by 


N 
$ =VE)+A( XP -1 ; (4) 
Where A is a constant multiplier. Now the value of V (25) in 


terms of Pys is obtained from (12) after substituting for mi 
from (31), and is given by 


N 


N 
DIS AE и ONTT 
VQ) = i Із МР, =» тама / M8. 


іші 


Substituting from (35) in (34), we write 


N N 
2.1 Mi Dis 2% 1 ЈА 2 
AT n | MgP, P- ЯР nkMy MS 


іші 


іші 


N 
De l М; с. (2 ба 
zo di mS +A 2 38 
іші 


Differentiating 4 with respect to P; and equating to zero gives 


% _ __1 (мз MS? 
) 


| 
524 Ма | Ба Рај ФА 0 


Непсе 


Муў, J! - ұға 
A АН ли е Д0 
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But x P;=1. We, therefore, have 
іші 


(і --1,2,..., М) (36) 


It will be noticed that the optimum value of Р; depends on 
two factors: (i) the total of the character for the i-th first-stage 
unit, and (ii) the coefficient of variation of the first-stage unit 
mean. The latter will usually be a very small fraction so that 
P; will be primarily determined by the first-stage unit total Miy;. 
In practice, however, My; will not be known although quantities 
correlated with them, as for example those determined from the 
previous census, may be available. Failing to have these, it would 
appear that the choice of P; proportional to Mi would give about 
the optimum probabilities. 


· It should be pointed out that the above solution for optimum 
probabilities will hold even when the cost function is represented 
by (24); for, under the assumption that Ретум; = k, (24) 15 
Seen to be independent of Pys. However, in a number of 
Situations the survey cost may not be independent of Рр, 
Consider, for example, a situation where a list of first-stage units 
is available but that for the second-stage units within first-stage 
units is not. Listing of the second-stage units within the selected 
first-stage units is, therefore, an essential part of the survey 
work. The situation is of common occurrence in agricultural 
surveys in under-developed countries, Thus, lists of villages are 
readily available in most countries and the identification of 
boundaries of selected villages also does not present any diffi- 
culty. However, lists of second-stage units like fields or families 
are not available and have to be prepared for selecting a sub- 
sample. To the cost of Survey represented by (20), we, therefore, 
have to add a component for expenditure for listing. This will 
generally vary with the size of the first-stage units. We, therefore; 
have a cost function given by i 


C = сп + с, > m; + су X М; ` (37) 
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where c, represents the cost of listing a second-stage unit in 
a selected first-stage unit. Tbe average value of C in repeated 
samples will be given by 


N N 
суп + con X Pm; + сұп = РМ, (39) 
іші =l 
and is seen to reduce to 
N 
сіл + сткМр + сп X РМ, (39) 
i=1 


for Pim; =kM;. The cost function (39) will now be seen to 
depend upon P;’s. However, if M;’s are unknown, M, will also 
not be known and the estimate 2, can no longer be used. Several 
alternative estimates can be formed. We shall consider one 
such estimate in this chapter, namely, the ratio estimate, and 
thereafter resume discussion of the problem considered in this 
section. 


8.6* Ratio Estimate 
Let 


Jg = 0, = RD, (40) 


denote the ratio estimate of the population mean, where 


Mi с! 3 Mi; - 
Zij MP; Ju 2, = 7 E MP, T 


(41) 
TRS M; . 5 EN ) V. Mi - | 
jp = MP; Ху LE МР, т 


x standing for a supplementary variate observed for all units 
in the sample. 


N 
|| 


Then from the results of Chapter IV we may, for n sufficiently 
large, ignore the bias terms in the expected value of y, and 
regard it as an unbiased estimate of the population mean for 
all practical purposes. 

24 
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To obtain the variance of the estimate Jp, we shall start with 
the expression (28) in Chapter ТУ. Since E (25) = у. and 
Е (ðs) = х, we write to a first approximation 

She: CE) ES V (9) _2 Cov Go бо). (42) 
Е 3 xc 2%, 


Now, from (12), we have 


N 
1 MOS x 
V (2,) —-5 ( МР, vs 
ші 


and by analogy 
N 
ЕСЕТ Mix а 
у(б) = 1 (> MES “) 


1 " 
tam 2. dns (а - м)52 6% 


Further, 
Соу (2,8, = E((z, — Z.) (0, — ò. )} 
= Е (2,5—2, + £, —Z,) 
X nny — Ön. + 5, — v.,)} 


= Е (ann — Zn.) (бит) — By, 


+ (2. — 2.) (5n. — 0, )} (45) 
since the expectations of the other two product terms are zero. 
Now taking the first term in (45), we have 


E (Gn = Z,) (бит) Е $,)) 


Dt T 
=> [| У МР, Оцу Ji. | 
i “ 
4 Ж МР, (у ој] 
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=ДЕ р Мура Gunn — Ја) Bump — 2.) 


„М Мар S; било — 9.) Brent — Зе. | 
1520 
1 
= EMS [DM SEG i(mi) => ) Сато — X. ) | J 
= 1 : M? 2 м) | 
Ме (> Pe м, Su. 


N 
І Mè (1 _ 1 
E L (-и)8% ae 


The second term in (45) is clearly given by 


Е (2, — 2.) (9,, — 9.) 


1 
c n S, 
iy M M 
= > И а. эф E. ЕН, 
(on di ® (>, у. >.) (6 ч. A) 
1=1 
1 уз Ма? 
Um | МР, ^+ а. (47) 
іші 


Substituting from (46) and (47) іп (45), we have 


N 
"m 1 Me -a ера 
Соу (2, 8,) = 1 а 2 P, Ji. E 
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Hence, from (42), we obtain 


N 
5 І у М 
У (в) = = |} МЕР, (s. x. Xi. 
i=1 


N 
1 ме (1 эг) рё (49) 
* nM? ја P, е М, а ( 


іші 


“2 


where 


D? = Sj? + RS? —2RS,,, (50) 
and 


When ху = 1, the expression for the estimate y, is simplified, 
being given by 


E 2, 
Ја = ii, (51) 
where 
M, 2 "ш 
Hem ДЕ and a= Уа : (52) 
Also, 
S, —0 


and the variance of Ja takes the simple form given by 


Mj? (1 1 
+ Zi P (т, M ) s (53) 
An estimate of V (Pp) is eas 


that, following the steps sh 
given by 


ily obtained. The reader may verify 
own in Section 7.13, the estimate is 


т 1 1 Me ы қ 
Est. V 0») = n nel МОРУ Gama Jen? (54) 
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8.7* Allocation of Sample and Determination of Optimum Prob- 
abilities: General Case 


We shall now return to the problems discussed in Sections 
8.4 and 8.5, namely, to determine the optimum values of m;’s 
and P;’s when the cost function is represented by (39) and the 
estimate used is the ratio estimate ys. The solution is straight- 
forward since we note that: (а) V (ӯр) has the same form as 
V (Zs) except that instead of opz, we now have 


2 P, (21. E Rv, )? (55) 


(56) 


and instead of S;? we have M?D?2/M,2P2, and (Б) the cost 
function (38) regarded as a function of m; has the same relation- 
ship to m; as the one represented by (24) has with mj, except 
that с; is now replaced by су, where 


N 
€ == а, d с X РМ, (57) 
іші 


It follows from (29) that the optimum value of әт; is now deter- 
mined by 


E а” Mi А 
mi M, NES P, (15-<1,2; ә М) (58) 
where 
J 1 у M 
A i ~ је i 2 
4 Р, E. Ко.) ме Р, Di (59) 
1=1 іші 


If Р; is constant we reach the same result as (31), namely, 


Pym, = 


4 
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To determine the optimum probabilities we form the function 
ф given by 


$ =V On) +u (C—O) +a (Яр-а (60) 


where ш and А are Lagrangian constants. Substituting for V (Pp) 
and C in terms of P;, k and n, we obtain 


N N 
1 Mg, E l 
ф = пм? a P, 9 — Rx, + Ё 2. Мр? 


іші іші 


m {п (s + ЕМ, ++ с, z м.) $ с} БЕ (Је =) 
(61) 


Differentiating ф with respect to Pi k and n and equating to 
Zero gives 


DIM NE E z M. 
әрт nM Ins 6G. — RE, — Ра рг) 
+ильМ,+А=0 (= 1,2, ..., М) (62) 
N 
dW 1 2 
Ok амо MD? + ипс,М, = 0 (63) 
1=1 
N 
d$ _ 1 М. Е 
3n E Р, Oi. — Rx, y 
іші 


de ш (4 + kM, + [^ 2 Рм.) =0 (64) 
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The solutions of N + 2 equations (62), (63) and (64) subject to the 
two conditions imposed by the fixed cost and the sum of Pj's 
being unity, give the optimum values of P;, п and К. To solve 
these, we multiply (62) by P; and sum over all the N units, 
giving us 


N 
= ипо У Р.М, +À (65) 
From (63), we have 
N 
== L M,D? = unkc M, (66) 
іші 


while (64) gives 


= и (сүп + спкМо + ст 5 РМ) (67) 
i=1 


Subtracting (65) and (66) from (67), we get 


A= pon 


whence substituting for A in (62), we get 


(б Rx y — De 
P, о М; Tia P " М; (68) 
а + М; | 
реча. 
M, ie RX ) = М. 
a N ae 2 (69) 


Tors = pee АЮ 
ЭРИГЕТ 
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It will be noticed that the optimum value of Р; now depends 
upon (i) the size of the first-stage unit, and (ii) a quantity 

x ir S Dg 

8, = (ў. — Кх,) — M. 


i 


(70) 


Hansen and Hurwitz (1943) have presented evidence which shows 
that ô; tends to decrease as Mi increases although not as fast as 
Mi. Assuming д; to be constant, it will be seen that for сәт 0; 
Р; varies as M;, thus confirming the result reached in Section 
8.5. When c, = 0 and c, > 0, probability proportional to the 
Square root of M; will appear to be the optimum. 


Solving the other equations the reader may verify that the 
values of К and n are given by 


E ct ~ іші 4 (71) 
2 MP 
су 6M, '* 
іші 
апа 
ata 2 — (72) 
а + КМ, +c, 2 P.M, 
іші 
The optima will naturally vary with the cost function and the 


sampling system, and сате is necessary to determine from pilot 


studies the nature of the cost functions before deciding on the 
optima to be adopted for the surveys, 


8.8 Relative Efficiency of the Two Sub-Sampling Designs 
We remarked in the introduction 


to this chapter that a sub- 
sampling design in which the 


i ! selection probabilities аге ргорог- 
tional to the size of the first-stage units, and a constant number 


of second-stage units is drawn from each selected first-stage unit 
may bring about a marked improvement j 1 
to the sub-sampling design involving the 
probabilities. In this sectior 
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arithmetic mean of the first-stage unit means as the estimate for 
the latter system. 


For the system of sampling with probability proportional to 
size and a fixed number of second-stage units selected from each 
first-stage unit, we have seen that z, provides an unbiased estimate 
of the population mean, and its mean square error is given by 
(14). This may be rewritten as 


N 
1 Mie ен Өй 

MSE G) = у) ў (9 = P3 
іші 

М 

+ ж М; S? (73) 
| nmN M 
іші 


For convenience, we shall suppose that first-stage units of the 
same size are grouped together. Defining then p; as the intra- 
class correlation within first-stage units of the same size, we may 
write 


ра? = Е{(у,— Ӯ.) Ou — F.) |} 
= Е(0,- у. +H. — У.) Ow — Pe +i. — 7.) | 
E {0 —.) On — 7.) |) -КЕО,- 7.) |} 


Mi 


= М, m - 1) 2s Ou -») Qu — ў.) T 0, — 7.) 


јуекил 
М4 
= ии, —1 [i Ou — Ж. | = ја d 
POT 
| S? 5 " 
= — М, d беу (14) 


Substituting the result in (73), we obtain 


N N 
"EN T. 
M.S.E. (2) = v ar ^ san “5 (75) 


іші іші 
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For a sub-sampling design involving the use of equal selection 
probabilities at both stages, the simple arithmetic mean estimate 
is known to be biased and the mean square error is derived directly 
from (84) and (85) of Chapter VII. Ignoring the finite multiplier 
at the first stage of sampling, this is given by 


MSE. 9) = 1, y Ío.-».» — 24) 


* as à s^ + (1-1) Gy -»» 


2 a 1 н 
ag pu 2 
nN "m Бақы; nmN үз 5; 

іші іш 


IN E 
*(-296.-»» (16) 
The difference between the two mean Square errors is, therefore, 
given by 
А ем 
Уа) лен с на ic MN 
M.S.E.(3,) — M.S.E. (2) re ) 5 (4 1) 


ст (1 > У On. = у. ЈУ 


Putting Sr exe (IR 9j) where à; has a meaning similar to рї 
but is not necessarily equal to Pis 


we may express the difference as 


| 
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Now both р; and 8; will usually decrease as М; increases, so 
that the covariance between p; and Mij|M and between 6; and 
ММ wil be negative. The first term with its negative sign will, 
therefore, be positive but the second term will be negative but of 
a smaller order owing to the presence of the factor 1/m, while 
the third term will always remain positive. We, therefore, conclude 
that у; will ordinarily have a higher mean square error than 
2, showing the superiority of 2; over the estimate y,. 


8.9* Sub-Sampling without Replacement 


In the sub-sampling procedure considered in the previous 
sections we have assumed that sub-samples from the same first- 
stage unit are drawn independently of each other. We shall now 
consider a procedure in which sub-sampling is carried out wholly 
without replacement, that is to say, that if any first-stage unit 
occurs y times in the sample, a sub-sample of my units will be 
drawn therefrom without replacement. 


Following the previous notation, let 


М, 
Zi = Ji 
МР; 
апа 
Ži, my, = the mean per second-stage unit of the sub-sample 


drawn from the i-th first-stage unit 


Consider the estimate 7,’, given by 


1 N 
24-2 RI m (78) 
іші 


where y is 8 random variable with possible values 0, 1, 2, ..., п, 
such that b у; = п, and the probability that y, is equal to r 
is given by t the (r 4- 1)-th term of the binomial 

Р, (= Рур 
патеју, 


Р {у = = () P; (1 — By 
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It is easily shown that z,' is an unbiased estimate of the popula- 
tion mean y.. For, we have 


к (у) = ЈЕ p уд, «| 
1 à ] 
Ag E h ЖЕ (% nyd i lt 


i=1 


N 
I 
ae т Е (у) 2, a (19) 


We already know that for a binomial distribution 


Е (у) = nP, 


(80) 
On substituting from (80) in (79), we get 
1 N 
EG) =1 G пра, 
N 
= 4 Ра. 
іші 


І 
СУ 
SS 


(82) 
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To obtain the sampling variance of the estimate, we have 


V) =E | » T M -+ 


1 5 а 
= RE | h Cu 8. 8,2 ] 
1=1 
i N 
= 5 Е р ve &, my, 7 2,-52, — z 


іші 


N 
1 ы " EIL 
F m E | D vor (Zi, тү; — 2. TZ) 


izei'zi 
x (2, uy. — Zy, F Zy.— ә) (83) 
Taking the first term іп (83), we write 


E (Буби +20) 


E [È vn my, — 2) + G2.) 


+2 Er m Zi) @— 2.3] 


Е [2 y? {к ( (Z, ny, 2% у.) е Е((2,—2. УІ, x, 
+ 2E (в. ту Ži) Gi Za) l i ӘҢ 
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Likewise, the Second term in (83) gives 


E { 5 УУ (2, my 7” 2. + 2.— zJ (2; — = ua 
ізбі?ші i 


my, 


= ЕЁ "3 Уг с... — 2.) (2, my, Zy.) 
T Quy — 2,) (2, — 2. ) + (2,- Z.) Gr, тү,” zy.) 
+@—) (ви. —2, »] 


ау (ащ г)! (85) 
ізеі?ші 


since the expectation of the other three terms is clearly zero, 
Substituting from (84) and (85) in (83), we may write 


1 148 CT 
(9-42) Тао ар E | 


| 
ы 
ТГ 
ca 
Ms 
| by 
S 
Ыб 
to 
| 
GR. ===> 
= 
|ы 
EX 
Hu 
[EN у 


From (80), we know that E (у) = nP,. 


ris To evaluate E (73), we 


t 


FO) = FAP 


POL 
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which is clearly the second moment of the binomial distribution, 
and is, therefore, 


= nP, (1 — Р,) + тР,2 (87) 


To evaluate the third term in (86), we require the values of 
E (y?) and Е yj We write 


Ely, y) = Е{Е (уу, |у,)} 
= Ety;E(y, |7)) (88) 


where E (у; | y,) denotes the expected value of Ур given Y, Now 
the probability of drawing Yj. given ур from a sample of 
(n — у,) is РА — Р). Hence, by analogy with (80), 


P, 

Fr, nj = n=) ті (89) 
Substituting from (89) in (88), we obtain 
= . ыы ә“... Р, = 
Еби) = Е (n: D p] 


_ ЯР, кеге ° 
“125 Hb? rog m 


=т= п, — | Ер pb (1 — b) + тв) 


= (л DUE (90) 
Using (87) and (90), the third term in (86) may now be written as 
N 2 N 2 
E [Zv(&-22] = 2 £096. - 2» 


*XEGy)0.—2)G,—2) 
ізбуші 


= Ў {аР (1 — P) + трд) (8, — 2.) 


tX (а) PP} 
i£j=1 


X (Z. — Z.) (2, — 2.) 
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N 
=" 2; Р, (2, — 2.) 
іші 


N 2 
4-n(n—1) ја P,(&,.— 2.)] 
€— 3 БЕС 


М 
since X Pi(Zi — 2.) is clearly zero. 
4=1 


Hence, substituting from (80), (87) and (91) in (86), we obtain 
on rearranging terms, 


Ven =o +h ‚(ж = у) 54 


= == ЕТЕ 
т ln5u5 өз 


which can also be written as 


2j 1 M; ,. 
Ve) = DH Oso 


N 
= 0—1 ) | М, 
nM, M, S? (94) 
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If further S? = Sw’, we have 
ғ оз | Se (1 EE . Sé 
Уб) = Тт (m M п NM e» 


It is interesting to compare (93) with (12) after putting 7% = m 
in the latter. The comparison shows that the variance is 
reduced by 


showing that the procedure of sub-sampling considered in this 
section is more efficient than the previous procedure. This is in 
accordance with expectation. The gain is however small, since 
the contribution to the variance from any modification at the 
sub-sampling stage must necessarily be of a second order 
(Sukhatme and Narain, 1952). 


$.10* Estimation of the Variance from the Sample when Sub- 
Sampling is carried out without Replacement 


Consider the mean square between the first-stage unit means 
in the sample defined by 


N 
(п —1)5„® = у, Ci ny, — 2.) 
=1 i 


N 
= X уй ру — 12,° (96) 
Taking expectations, we obtain 


N 
@—1)Е(ыь® = Е (2 vE Fam lor} — EG 


— nE (2,?) 


[Ese E 


— nE (2,?) 
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where n’ denotes the number of different first-stage units included 
in the sample. Substituting from (92) for E (2,2), we may write 


1 С 5,2 

п = ACE 

| Ten ij Шы: 
іші 


or, оп dividing by (п- 1), we have 


N n s 
2 s, 5 
Әй E ae Sis 
Е (5,2) = e, ті Р, М, Zr Se 


5, 
us AE M. (97) 


5? =з бу? = тыл $ 8,2 
% п—1 { п-і ту, 

1 п 1 | m 

n (n — 1) а Cn gs 4) Sis 


1 5,2 р 
ШТ ЛЫ, (98) 
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ог 
Big =e = г у ies В М): 5.2 
А E n | Й 
tn (n — 1) 2. (5 Жа xj Siz 
= | у Р, Sic” (99) 
п ' Mi 
where 


If Si = 52, (97) is simplified, being given by 


P, Sub E) 


E(s,?) = o —S 2g, то п 
іші 
М Р ы 
+ Sea? 3. Pe 
%=1 S 
whence, we get 
Я үз E (п) =i • S Sia Spit т Bi 
pun dii АН mt М, п м, 
(100) 
where 
n my; 
Б жа 5. en 2; (zu — £i v ту)? 
= m nm — n 


and Му denotes the harmonic mean of Муз in the sample. 
Further, when Р; = МММ, we have 


қүрт Е(т) —1 1 1. 
нысы = ка [Sy йт та) qon 
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Substituting from (99) in the expression (92) for the variance 
of z,', we get 


Est. V (Z) = ^n * mn(n —i) ny c 


(102) 


When Pi = Mi|M,, and 512° is identical with s; and can be 
replaced by a pooled estimate Sw", we get 


он даһа. 
Est. У (2) = 9^ _ 5° (Е (н) —1 v 


n nm | n—1 ^ => УМ (103) 


The sampling procedure described in this section and the 
preceding one is widely used in India for estimating the acreage 
under crops. The design is particularly suitable for the introduc- 
tion of the improved methods of estimating crop acreages in 
tracts which are cadastrally surveyed. Thus, in Orissa, in India, 
where this design was first used, a village is used as the first-stage 
unit of sampling and selected with probability proportional to the 
number of survey numbers (fields) in the village. Each selected 
village is further divided into sub-units of 8 consecutive survey 
numbers, 1-8, 9-16, 17-24, etc., the last sub-unit consisting of 
the remainder. From the sub-units so formed, 4 sub-units are 
selected, giving an equal chance to all sub-units in the village. 
If a village occurs more than once in the sample, say y times, 
a sub-sample of 4y sub-units is selected from it. 


The design derives its efficiency from two factors: 


(i) the high correlation between the number of Survey numbers 
in a village and the crop acreage, and 


(ii) the convenience and есопоту in field work arising from 
the choice of natural units as the sampling units at 
each stage. 

Uncultivated land, such as th 


at occupied by dwellings, lying 
barren, or used as grassland 


> is usually given a single survey 
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number in India, while the cultivated land, which is divided into 
a large number of fields, is given at least as many survey numbers 
as the number of holders in the village. Compared to the design 
which makes use of artificial units like square grids marked with 
the help of latitude and longitude on the map, such as for example 
the one used in the Bihar and the Bengal surveys (Mahalanobis, 
1945 and 1948), and in which apart from administrative incon- 
venience in locating the unit on the ground, a large proportion 
of units falls into uncultivated tracts, this design is found to ђе. 
not only convenient for field work, but also statistically efficient. 
For further reference the reader is referred to the report on the 
estimation of acreage under crops in Orissa (I.C.A.R., 1950). 


8.11* Stratification and the Gain Due to it 


In this section we shall give the formule for the estimate of 
the population mean in stratified sampling and its variance, and 
then proceed to estimate from a stratified sample the change in 
variance due to stratification. 


Let P; denote the selection probability assigned to the i-th 
first-stage unit within the f-th stratum, so that 


2 Ра =1 (t —1,2, ..., k) 


Let л; denote the number of first-stage units to be included in the 
sample from the /-1һ stratum, so that 


k 
2; п =" 
ігі 
and my the number of second-stage units to be selected from 
the j-th first-stage unit of the ¢-th stratum. Defining then 


_ My, 1 (і--1,2,..., т) 104 
Zy = Мы” m Уш; (FH 2, «+з Ин) we: 


it is easy to see that 


nt 


us (105) 


2. = тн) 
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provides an unbiased estimate of the population mean of the 
t-th stratum, and 


k 
2 = 27 л, 2, (106) 
that of the mean for the whole population, where 
_ Мо : 
қ. = M. З | (107) 
The variance of Zw is given by 


Е Ny 
= 2 J Oinen 1 . ut = EN 2 
V (Zo) pa М | т; ay п 2 Pu та М) Š но 
іші 


Іші 


(108) 
where 
N 
Paan = Z Pu (ги. — 2. (109) 
апа 
| Mi 
Sey = ML (zi — Zu.) (110) 
M,—1 


j=1 


If the sample were chosen as an unstratified sample, then the 
population mean would be estimated by 


a 


+ 1 x 
2, = 2 2 E] (111) 


М,” Р, Y (112) 
with its variance given by 


N 


өлү — e 1 1 1 
Шы ы (> м) 8o (113) 


1=1 
where 


N 


O?rta) -E Р, (4, — 2.) (114) 
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апа 


à 1 “з 
Буер. = M-i > (21; — 2.) (115) 


The difference between (113) and (108) gives the change in 
variance due to stratification. To estimate it from a stratified 
sample we shall suppose that the 7th unit in the population 
corresponds to the i-th unit in the ¢-th stratum, so that 


Pu = В (116) 
t. - 
where 
Ne 
y = У Р, 


and rewrite equation (113) in a form more suitable for estimating 
the gain due to stratification. Substituting for P; from (116) in 
(114), and noting that ду = 21: (Ху Ра.) we write 


в М i А 
обер = Ж Р, RAC P, — 2 ) 
1=1 іші 
. у ^ ^ Х 2 
= Уң Ур, (2. Р; = 2, Bt Р, = 2.) 
t=1 іші 
[ Ni m у 
= Ул ур в + (1, 8. —3.)) 
іші ісі E 
ае k ux 
SHa (117) 
t=1 іші 
Also 
Ра = АР Р,8 (118) 
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Hence substituting for o?52, from (117) and for P'S yz, from 
(118) in (113), we get 


k k | 
Е 1 Ар 1 ДА а s 
К(а) = Е Р, Piney + = үз Р,. m -e 
іші 
EM 2 aT 
2x ау м;) Sco 


(119) 
The change in variance due to stratification is thus given by the 
difference of (119) and (108), namely, 
У 1 1 
(Ра- һы) = > л n =) Pinen 


k j i 
+ Di Gop, - x) 
ті 
1 " 
«h^ (a = ж] Siten (120) 


and it is this which we are re 


quired to estimate from a given 
stratified sample. 


We write 


k 


i xz | DN a 
Est. (Vus = Ра) = 2 A? ЗР; == ) Tinsi) 


ізі 


т к + Est. (2,2) — Est. е) 
Eiai- 


8%, (121) 
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Now, from (18), we have 


д. 
баца = S ng) — Ly 58 122 
tht) TT n (73 x) tiat) ( ) 


nt m 


esl Та ier I 
= in = Hj {Fam ts 


mti 


pre. 
sipu Ma ма та 1 зри 


Also 
Sites = 50,105) (123) 


It only remains to evaluate the middle terms in (121). We have 


V (a) = BEGA — =," 


3 
— Caen а Lye „(5 1 ) Shuto 
n т та My ңе 


so that 
nt 
мени РЕ ЖЕ P 1.) 8, 
Est. (Zi. ) = 2, n, ne mi М, ДЕ?) 
(124) 
Similarly 
3 = Е (2,2) — У (26) 
whence 
ы д2 
ІЛЕ 
іші 
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Hence, from (124) and (125), we obtain 


1 USE 
* D Ge Му) Sen 


(126) 


Substituting from (122), (123) and (126) in (121), and collecting 


terms, we get 


d ДЕ ыа = 
Est. {Vus чех М р. 2-2 


ізі 


А п 1 
„ы [E _{ + 1 6%, j 
n, Nn P, nP, “п iet 


A? fn I 1 
ПРУ (а-а "uta 
іші 
» | Ка 
XD (по = ле) Виа (127) 


equation (127) is further simplified, being given by 


k 
2 
Est. {Vos — Vs} = n a M 4 
t. 
t 


On using (122), 


ші 


+ [x п, 1 1 Р 
т, ПР, ОР, + Э Siven 


n 
(128) 


SUB-SAMPLING (Continued) 395. 
If P; — МЏМь, then 


ЖҰТЫ % 
N = uA 
Хр, Хм, Me 


Pi; 


Tt follows immediately that 


Zi = Py = Уну = Зи 


Also 
Ne ^M, М 
ут 1 to 
Р, = У Р, = 2 М; M. л, 


Then (128) reduces to 


k 
1 = ib ла 
Est. (Vus — Vs} (p, A = n | Ay Za — Я 
t 


5 
А 2 
БЫ 92 an, іш = А D} Siwen 
іші 


(129) 
Example 8.1 


A sample survey for estimating the acreage under paddy was 
carried out in Orissa State during 1950-51. Each district of the 
State was divided into a suitable number of strata by grouping 
together adjoining administrative divisions in the district. From 
each stratum a sample of villages was selected with probability 
proportional to the number of survey numbers in the village. 
Each selected village was divided into clusters of 8 consecutive 
survey numbers, 1-8, 9-16, 17-24, etc., the last cluster consisting 
of the remainder. From the clusters so formed a simple random 
sample of 4 clusters was drawn (without replacement). If, how- 
ever, a village occurred more than once in the sample, say y 
times, a sample of 4y clusters was selected from it. 


Table 8.1 shows the number of villages and the number of 
clusters in the population, the number of villages in the sample 
and the number m of distinct villages in the sample, the 
estimated area under paddy per cluster and the values for stp’, 


зар, stp? and Stw? for each stratum. Calculate the sampling 
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Stratum NM, MM, 


TABLE 8.1 
Area Survey on Paddy, Orissa, 1950-51 


Values of Means, Mean Squares between Villa 
Within Villages (51,7, 
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ge Means (з{,%, Sm?) and Mean Squares 
5ш 2) іп Acres per Cluster Basis 


т п Ут СТЫ 
1 434 71670 19 19 1-291 0-5058 
2 405 44114 13 12 2:597 3-7764 
3 565 33107 23 23 2-078 4-1255 
4 851 93734 34 32 2-098 1:8717 
5 271 24631 14 12 2-552 4-9885 
6 471 51776 18 18 1-675 0-6760 
7 347 44028 15 14 2-100 1:6487 


1:6482 3-3214 
Sums 3344 363060 136 


130 


variance of the estimat 
the district, assuming 


(a) Sub-samples of 4 within e 
independently ; 


(6) A single sample of size 
village (the method ac 


Assume М МУР; to be 1. 


cluster in the ;-th Stratum 


by equation (19) as 


= 5,2 
Est. V (5) = 


п, 


1-0726 
8-1920 
9-0720 
4-4411 
12-3426 
3-6231 
3-3946 


1-0726 
7:9921 
9-0720 
4:4334 

12-1021 
3-6231 


e of the area under paddy per cluster in 


ach selected village were selected 


4y was selected from each selected 
tually adopted). 
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The variance of the district estimate is, therefore, given by 


Est. V (Pe) = Js дае 
п, 


іші 
where 


The weights А; are computed in the first column of Table 8.2. 
Substituting from the table, we have 


Est. Vi.) = (00-19741): 97999... (0.25): ana 


+ (0-091189)? — 28 2255 + (0:25818): et 


+ (0-067843)2 + 1 ++ (0-14261)? шо 


+ (0-12127)2 Leas 


= 0-001037 + 0-004289 + 0-001492 + 0-003669 
+ 0:001640 -|- 0-000764 + 0-001616 
= 0-01451 
TABLE 8.2 
Computations for Estimating Gain in Precision due to Stratification 


ММ, қ à 2 

Um Н a Mee ЖО атчуу 05086 
0 о.ө Ф ©) ө 

1 019741 0:2549 0:3290 000076397 --8:6504 --:0003343 

2 0-12151 0:3156 0-8195 "000068727 —4-4038 — 0011430 

3 0-091189 0:1895 0-3938 *000029152 +9:6895 + :0011653 

4 0:25818 0:5417 1:1364 *000055835 —1-8543 — 0001938 

5 0-067843 0:1731 0-4418 “000035632. --3:8412 ++0006828 

6 0:14261 0-2389 0:4001 “000058256 --2:2524 - “0000887 

7 0-12127 0:2547 0-5348 000059446 —2:3714 —:0002324 


Sums 1:9684 4:0554 — "0001441 
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Under assumption (4), an estimate of the variance in a single 
stratum is given by equation (103). We may write 


Ж Sa? 1 n,— n, 1 
Dp , == MES : EL UE I == . 72 
Este Фа n, Ej nm n=l xx) St 


For example, 


1-8105 1 2 1 | 

m med (ае @ 33 — s) QM 
= 0-053250 + (-00044563 — -00001067) (4-4334) 
— 0-055178 


Est. V' (Pa) 


The variance for the district estimate is then estimated by 
7 
Est. VQ) = 2 A? V' Ge) 
іші 


--(0-19741): (0-026606) --(0-12151) (0-30281) 

1-(0-091189): (0-17910) + (0-25818)° (0-055178) 

+ (0-067843)? (0- 36971) + (0-14261)? (0-037486) 
+ (012127)? (0-11376) 

— 0-01481 


We note that contrary to expectation this value is larger than the 

‚ first estimate, a fact which may be attributed to sampling errors 

in the estimates of обе, о“, Stw? and Siw”; but the difference 
is negligible. 

The difference between the variance of the mean of a stratified 

sample on assumption (a), and that of the mean of an unstratified 


sample is estimated by equation (129). The necessary computa- 
tions are made in Table 8.2. We have > 


k 
Est. {Vus — V3 = E | № 2,2 — atl 
а | 


k 


А 
F Ў, TA (Qn, — 1) — à, (n — DYS 


t=1 
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1 
= i36 {4-0554 — (1-968)*} + (— 0-0001441) 


0-001341 — 0-000144 


\ 


= 0-001197 


Thus the relative increase in variance if the sample were not 
stratified would be 


.. 0-001197 
~ 0-01451 


= 0:082 


or 
8-2% 


8.12* Collapsed Strata 


Hansen and Hurwitz (1943) advocate stratification to a degree 
where only one first-stage unit is selected from each stratum. 
Stratification to this degree may secure an optimum distribution 
of the sample when the strata are about equal, but offers one 
disadvantage in that an unbiased estimate of the error variance 
cannot be made. To overcome this difficulty Hansen and Hurwitz 
pool the strata in pairs which resemble each other as closely as 
possible and calculate an upper bound to the error variance of 
the estimate. One such procedure of calculating an upper bound 
will be described in this section. 4 : 
We shall suppose that the population consists of only two 
strata and that from each stratum one first-stage unit is selected 
with probability proportional to the number of second-stage units 
in the first-stage unit. Let 
N, and N, denote the total numbers of first-stage units in 
the first and second strata, respec- 
tively; ` 4 

Mi and Му the numbers of second-stage units in the 
i-th and the j-th first-stage units of the 
two strata, respectively (== 5200-75 
№), G= 1,2, М): 
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Му and М» the total numbers of second-stage units 
in the first and the second strata, 
respectively, so that 


Mo = Му + М» 


М, and M, the average sizes of the first-stage units 
in the first and the second strata, 
respectively ; 


Py and Psj the selection probabilities at the first 
draw for the i-th first-stage unit in the 
first stratum and the j-th first-stage 
unit in the second stratum, respec- 
tively ; 

and 


my and m; the numbers of second-stage units to be 
included in the sample from the 
selected first-stage units in the first 
and the second strata, respectively. 


For convenience, however, we shall suppose that the first-stage 
unit selected from the first stratum is the c-th and that from the 
second the d-th unit. 


Since Ру = Му М/> clearly Jm, Will represent ап unbiased 
estimate of Jı., the population mean per second-stage unit of 
the first stratum; and similarly, since Paj = Му Му Ут, will 
represent an unbiased estimate of Ў„„ for the second stratum. 


An unbiased estimate of the population mean У. for the two 
strata together will be given by the weighted mean of Ут, and 
Ўт. and may be denoted by ја, given by 


Vo = AP me + (1 ЕЕ à) Уаш (130) 
where 
"T 
key (131) 
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The variance of y, will be 


V Pe) =ОШ) + CL —23* Y (д 


where 
о" == PET (з. А.» 
es (133) 
Oa? = X Ру (Уз, =.) 
j=1 
and 
Мц 
Р 1 NOS 
5,8 = Mg 1 2 Qui — Ји.) 
~ (134) 


м. 


3 1 ; Ф 
S = MS Ж (уа Ўз. 
тті 


Now, if the sample of two first-stage units were selected as an 
unstratified sample with probability proportional to the measure of 
size of the units, then an appropriate estimate of the population 
mean would be given by the simple arithmetic mean of cluster 
means in the sample, namely 9%; and its variance by 


VG) =} fot + ye (= az) 8) (135) 


where 
N ТТС E - { 
с = У P, On — P.) (136) 
Іші 
апа 
М, 
1 = о 
52 = = Qu. — 7.) (137) 
— 1 
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It follows from the relation Ра = Муј Му and (131) that 
P, = АР (131,2 ә») (138) 
for units in the first stratum, and 
P; = (1 —A) Pj (/--1;2;--.; М2) (139) 
for units in the second stratum. | 


Now equation (120) of the preceding section shows that when 
Р„= X and №, = М, and the allocation of first-stage units to 
the two strata is in proportion to A: (1 — 2), (135) provides an 
upper bound to the actual variance of the estimate of a stratified 
sample. For evaluating it from the selected sample, we require 
the estimate of оъ. We write 


N 
oy = = Р, (hi. V) 
Ny Е је á Е 
= 2 АР, Fai. — Ја. ae vas ae 
Na - e 2 E 
s 2 (1 — А Po; (ба, — Yo. + h.e )* 
Ni z _ 
= 5 АР (ai. — Yi. + (а. — У. 


PF @ —A) Pa (ы, — .* + Oa — 9.9) 


| 


doy,” Ttü- 9 93? +A O= У.) (ШЕ л) O.— Ӯ. i 


Ant + (1 — А) on? Аў, з + (1 — A) Ja? —5* (140) 


We know that 


V Orne) = Е (03:5 i Ў? 


=== % 1 1 
IUS ) P, (= =) 8,2 (141) 
i li 


о жәл rg 
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and 
У Oma) = EOS) — ја“ 
3 1 1 А 
ее м) Se (42) 
giving us % 
а 1 Ае 
Est. 3o Ju — 68! — (а — му) Sit (143) 
and 
^ T3 1 SS 
Est. Ja? = Ju — дё — (2 — мр) $e (144) 


Also, from (132), we get 


тш М. 
Y Мы 
– а — 29 [ae (E = по) 8) aem 


Substituting for y,.?, ),2 and ў? from (143), (144) and (145) 
in (140), we get on simplification 


Est. оц? = Ау E (1 A) Panga” — Fe? + Убу 4 — A)? Gap? 


1 
а =) (= – x.) 8,2 
у ==) £e - az) 82 (46) 
Moa Ма) > 
Estimates of c? and озь? cannot be calculated from a sample 
of one each. Іп a reasonably efficient scheme of stratification, 
however, o? and op? can each be expected to be smaller than 


cp. Replacing them by their upper bound, namely op”, we then 
obtein for the estimate of оъ? an upper bound given by 


s 1 E e тсе = 
Est.o,? = 2(1-) (ae E E 2 бы =a) 


1 1 A 
a а NY з} 
2 Стя 2 5,2 | Moa 70) 2 
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Now, an estimate of the variance of the sample mean, if the 
sample were selected as an unstratified sample, is obtained from 
(135), being given by 

5 L us 1 үа 
Би. VO) = 5 [аа (E — м,) & 


Substituting for oj? from (147) and replacing $; апа б.а: by 
Sic? and 5:47 respectively, we get 


i = E WT ANI 
Et VG) = 5 [xq y Om I + д 6445 
- 8 =>» (s. з) зе 


For A= 4, we have the inequality 


oy? > 1 (012 + оз?) 


Consequently (147) will always provide an upper bound to the 
estimate of op? and hence (149) will provide an upper bound to 
the actual error variance of the mean of a stratified sample. 
Substituting А = 4 in (149), we get the simple expression 


Est. VO) = $ Ona — Рио (150) 


It should be emphasized that the expression (150) is only an 
upper bound and does not necessarily provide а satisfactory 
approximation to the error variance, In fact, in most surveys 
where stratification is effective, it will be found to result in 
a considerable over-estimate of the actual error. 


8.13 Sub-Sampling with Varying Probabilities of Selection at 
Each Stage 


Lastly, we shall consider a sub-sampling system in which units 
at each stage of sampling are selected with replacement, with 


SUB-SAMPLING (Continued) 405 


probabilities proportional to measures of their sizes. It is easily 
shown that under this system each unit measure of size gets an 
equal chance of being included in the sample and in consequence, 
a simple arithmetic mean of the ratios of the observed value y 
to the measure of size x for the units in the sample provides 
an unbiased estimate of the population ratio of the total of y 
to the total of x. 


Let 


Ху denote the measure of size of the j-th second-stage unit 
within the i-th first-stage unit; 


X; the measure of size of the i-th first-stage unit, 
so that 


N 
х = S fj апа х= 5 Хх, 
У; the total of у for the i-th first-stage unit. so that 
М. N 
У; = Ху, and Ү = У Ү, 
ј=1 =1 


rig the ratio of у to x for the j-th second-stage unit 
within the i-th first-stage unit given by 


пи c 
Ri the ratio of the total of y to the total of x for 
the i-th first-stage unit given by 
E 
а 
and 
R the ratio of the population total of у to the 


population total of x given by 


It is easily seen that 


Е (гу |) = К, (151) 
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and 


E(R)=R pan 


and 


R, 
Š X, Y, 
ER)=) RI 


1 


Consider now the simple arithmetic mean of the ratios ry; given by 


А k © п т 3s 
Fas = nm 2 xs (153) 
i i 


Clearly then, using (151) and (152), we get 


ЕФ) = в bon и) 


-1 Y E(R) 


=R (154) 
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To obtain the sampling variance of Рут, we write 


2 


У Fan) = E Fam — Р) 
= E (Fm —R, + Ё, — R)? 


where Ry = (1/n) b Ri, 
= Еб(ћа — R + (Ry — В) + 2 Fam — Ко (Ry — 8) 
= E Fan — RJ! + E (R, — RP (155) 


since for a fixed sample of n first-stage units E (Рут) = Ку and in 
consequence the last term is zero. 


Taking the first term in (155), we have 


=, LZ» њушку GRO ] 


1721, 


= 2 4) (ћи — >} (156) 


the second term vanishing since sampling within the i-th and 
i’-th units is carried out independently. 


E: 
EI 
Ш 


"А, 


From (87) of Chapter VI, we may write 


2 
E (Fim — RJ? |} = 2 (157) 
where 
Mi 
сё = 24 (ru — RiP (158) 


щы 
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sampling units with equal probability and using the simple 
arithmetic mean to estimate the yield rate. 


8.14* Sampling without Replacement at Each Stage 


So far we have assumed that the first-stage units are selected 
with replacement. Clearly, the efficiency of a sampling procedure 
is reduced by including the same unit twice or oftener in the 
sample. One method of improving the efficiency is to group 
together into strata first-stage units of about the same size. 
However, as pointed out in Section 7.15, stratification by size of 
unit may not be always feasible and even where such stratification 
is attempted the units within strata may still show considerable 
variation in size. In this situation sampling without replacement 
within strata with probability proportional to the measure of the 
size of the units can be used with considerable gains in efficiency. 
In this section we shall extend the theory to the case when 
the first-stage units are selected without replacement with vary- 
ing probabilities of selection and from each selected unit a 


simple random sample of predetermined size is drawn without 
replacement. 


Let 


— Му 


“T M (a) | 159) 


where E (a;), as defined in Section 2 
of including the i-th first-stage unit 
the estimate 


a.4, denotes the probability 
in а sample of л. Consider 


n 


1 
2, = 


n Ly mo (170) 


Clearly, 2; gives an unbiased estimate of y. . 


К Dey 
Ей)-Е | » «| 
1 n 
= Е | È E Gim a 


For, 
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fi п һ 
b 
1 N 
-1 Ж E(a)z, (171) 
ісі 


Substituting for 2; from (169), we have 
Е (2,) — y. (172) 
To obtain the variance, we write 


VG) = Е (2,2) — у? 


= Е (с у 23! RUE 
EDET IE 


^ маў, ^ MM ju 5 
= Е iJ imi) 3 = ті)” тј) Ta 2 
| MPG Li MeEG)EG) | > 


eg у} МЕО. |) 
ME (a) 


16 "n MME һу ит) | i, д "ES > 2 
МӘЕ (а) Е (а) 4 
126) 


|È пир бе + (5 - ш) " 


DU MM, 
arm | МЕ (а) E (а) 
к 
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– 2 
^i 2 MEG) pe + Tm -и)8 8) 


М 
Е(аа) . MM jy, Yi. == 
A. (а) E(a) Ме e 


ізбіші 


(173) 


where E (ajaj), as defined in Section 24.5, denotes the probability 
of including the i-th and the j-th units in the sample. 


Expanding y.?, we may rewrite (173) as 


N 
= 1-Е (а) My? 
ye) Е(а) Mg 


іші 


N 
al Ж Е (аа) — Е (а) E (aj) М.М), Is, 
1 М 


ізді E (a) E(a) o 


А: E Ме Nm, -м) EG) (174) 


given by Horvitz and Thompson 


This expression was first 
(1952). 


For the case of simple random sampli 
pling, we k 
and (31) of Chapter II that қ елап 
EG а eae) = NN) 
N (a,a;) МОУ. 
Substituting in (170), we notice 
Js of Section 7.11, namely, 


that the estimate Zs reduces to 


2,35, = Mj, 
zh Бит) Е (175) 


and the expression for the varianc 


А е Бес і i ‘ 
given by (91) of Section 7.12, as omes identical with that 


expected. Thus, 
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V (2,) 
N 
e (1 = x) Азы M MM Lys. 
N Xn NJ N-1 М? 
ізбізі 
1 2 
D Уш M? 1C x) 5 
1 1 
x E > у) 5 v + 25 Уй й? -(2- AES (176) 
where 
г уум 2 
Sy Т h (% Vi. -3) (177) 
іші 


In developing these formule, we assumed that 7;, the number ` 
to be sampled from the i-th first-stage unit, is known in advance. 
In order to determine it we use the principle of minimizing the 
cost of the survey for a given precision, or alternatively, of 
maximizing the precision for a given cost. The application of 
this principle when the cost is represented by (20) is straightforward 
and follows step by step the analysis shown in Section 8.4. It 
will be found that the optimum value of т; is approximately 
given by 

m; Em — constant ‚ (178) 


i 


When E (а) is proportional to Му, as will usually be our attempt 
to make it, m; will be constant irrespective of the first-stage unit 
included in the sample. One important simplification arising from 
this design is that the estimate Z, reduces to the simple arithmetic 
mean of the лт y-values in the sample. 
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An unbiased estimate of V (z,) is easily computed. We write, 
from (174), 


4 = — а) М2 ET. 
Es. У) = У | T e Me EID 


7 Е (аа) — E (a) Е(а) MM; 
a 2, Elaa) Ee) Еа) Mj 


X Est. (7, у, | à, J) 
n 1 M? б = 1 TR 
+ бу (Еде Me Vm та) st. (S? | i) 


Т. 1- Е(а) M? E E s е) 
б ЕСИ SS None, МОВ 


+}, {куку — rem} A 


х Уто тј) 
п 1 Mè 1 Е: 1 А | 
tit GP ме (а) 5 | 


_ Y L-E(a) Мур 


i(mi) 
4 (Еде MR 


+) irre - ria 


ix 


мм. > 
Me Жуту) 


х 


n 1 M? 1 1 т 
Е (a,) Me (= “4 x) Si (179) 
An alternative estimate of the variance based òn a linear 


combination of the squared differences in the sample, which 
appears to be better than the one given above, can be formed 


NE ЕДА Е.Е" 
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following the device mentioned in the foot-note on page 71 of 
Chapter П. Thus, we rewrite (174) as 


N 


VG) = gg 2, (EGO EG) — Есе] (а, —2, у 


ізбіші 
1 ы 1 1 
+a), Ee) (= - az) 5 
1=1 


Ви. УЙ = Vi A (ғ # Eee) 


х Est.(@,—2, 814 j} 
D Жуу 1 қ 
“ад (Gem AL 


п 


"mo Cen) 


X (Фил) — Emp CAE 


ЗЕ ТТ 
+ ж 25 ur "E (180) 


The theory of sampling without replacement developed in this 
section is difficult to use in practice on account of the heavy 
computations involved in evaluating E (aj) and E (aiaj). For the 
important case of samples of two within each stratum, explicit 
expressions for Е (a;) and (Е ajaj) in terms of the selection 
probabilities P}, Po, ..., Py have been given in Section 25.4 and 
are relatively easy to compute. For larger samples the use of the 
estimate appropriate for sampling with replacement, introducing 
the usual finite multiplier for calculating the error variance, is 
probably sufficiently satisfactory (Yates, 1949). 
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CHAPTER IX 
SYSTEMATIC SAMPLING 


9.1 Introduction 


So far we have considered methods of sampling in which the 
Successive units (whether elements or clusters) were selected with 
the help of random numbers. We shall now consider a method 
of sampling in which only the first unit is selected with the help 
of random numbers, the rest being selected automatically according 
to a predetermined pattern. The method is known as Systematic 
sampling. 

The pattern usually followed in selecting a systematic sample 
is a simple pattern involving regular spacing of units. Thus, 
Suppose a population consists of N units, serially numbered from 
1 to N. Suppose further that N is expressible as a product of 
two integers k and n, so that N = kn. Draw a random number 
less than К, say i, and select the unit with the corresponding 
serial number and every k-th unit in the population thereafter, 
Clearly, the sample will contain the n units i, i + k, i+ 2k, тет 
i + n — ТК, and is known as a Systematic sample. The selection 
of every k-th strip in forest sampling for the estimation of 
timber, the selection of a corn field, every k-th mile apart, for 
observation on incidence of borers, the selection of every k-th 
time-interval for observing the number of fishing craft landing on 
the coast, the selection of every k-th punched card for advance 
tabulation or of every k-th village from a list of villages, after 
the first unit is chosen with the help of random numbers less 
than k, are all examples of systematic sampling. In the first 
three examples, the sequence of numbering is determined by 
Nature, the first two providing examples of distribution 
in space while the third that of distribution in time. In the 
fourth and the fifth, the ordering may be either alphabetical or 
arbitrary approximating to a random distribution. In the latter 
case, a systematic sample will obviously be equivalent to a random 
sample. The method is extensively used in practice on account 
of its low cost and simplicity in the selection of the sample. 
The latter consideration is particularly important in situations 


where the selection of a sample is carried out by the field staff 
27 


418 SAMPLING THEORY OF SURVEYS WITH APPLICATIONS 


themselves. A systematic sample also offers great advantages in 
organizing control over field work. 


In a systematic sample, as noted already, the relative position 
in the population of the different units included in the sample 
is fixed. There is consequently no risk in the method that any 
large contiguous part of the population will fail to be represented. 
Indeed, the method will give an evenly spaced sample and is, 
therefore, likely to give a more precise estimate of the population 
mean than a random sample unless the k-th units constituting 
the sample happen to be alike or correlated. The method 
resembles stratified sampling in that one sampling unit is selected 
from each stratum of k consecutive units. In reality, however, 
the resemblance is only casual. In stratified sampling the unit 
to be selected from each stratum is randomly drawn, in 
systematic sampling its Position relative to the unit in the first 

» therefore, the units in each 
» а systematic sample- will not ђе 
equivalent to a. stratified random sample. 

: Systematic sampling strictly resembles cluster sampling, 
Systematic sample being equivalent to a sample of one cluster 
selected out of the К clusters of n units each, shown in the 
Schematic diagram below in the form of k columns of и each: 

. Schematic Diagram Showing the Serial Number of the Unit 


in the Population in Terms of the Cluster Number 
~ and its Serial Position in the Clust 


a 


er 
Cluster 
Auge 1 2 ++ i age de 
1 1 2 
k 
2 4 +k 2+k + | 
ку: 1 + 2k 2--25 +2k E 
E : f 
2 я кёз 15% 
Ж, 1+0— 06 2+0=1) PGK ш jk 
п 1+ (0—1) АЭ Жет ђе ah 


(1) 
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Since the first number less than or equal to k is to be chosen 
at random, every one of the k columns gets an equal chance of 
being chosen as the systematic sample. It follows that the theory 
of systematic sampling can be deduced from the theory of cluster 
sampling dealt with in Chapter VI. 


In presenting the theory in this chapter, we shall assume that 
N = nk, where п is the size of the sample and К is an integer. 
In practice N may not be so expressible, and the results presented 
in this chapter may not be strictly applicable. However, the . 
disturbance is not likely to be large unless л is small. 

9.2 The Sample Mean and its Variance 


Let 
Уә denote the observation on the unit bearing the serial 


number i + (j — 1) k in the population (i = 1, 
ду won sy ЕЕ 2) carey Ы 


Ж. the sample mean 
i 
= уу 2. Vij (2) 
j=1 


the population mean 


and 


~ 


1 k м 
oe 9) 
іші ј=1 


Since the probability of selecting the i-th column as the 
Systematic sample is 1/k, it follows that 


год - У, gd ? 


=>. с (4) 


showing that a systematic sample provides an unbiased estimate 
of the population mean. 
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The variance is given by 


ИО.) = E {0 — y.) 


k^ . 
1 Ж т 
=) GF.) (9) 
4=1 
which can also be written as 


E (6) 


where Sj* denotes the mean square between column means, the 
letter c standing for a column. 


9.3 Comparison of Systematic with Random Sampling 


The variance of the mean of a random sample of n 


units chosen from a population of size N is known to be 
given by 


е ат 
Уба = (а н) в" (7) 
where S? is the mean Square between units in the 
This 1s not directly comparable with (6), and it is therefore 
important to express the variance of a systematic sample in an 
alternative form suitable for this comparison. We write 


population. 


k 


s ра 2; h ovs 


4-1 
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С? 9) з О» —».* 
+ Ж Qu— X.) Qu— У. | 


ізеті 


4ші ізбіті 


+ » ү (уу — Ӯ.) Оңу x4 (8) 


Now, by definition, the intra-class correlation between units of 
a column is given by 


— E Ou —P.) Ow —7.) 


В Е (уу — y.) 
2 2 Ow 3. Ou —».) 3 
__ {ел 507 Кп (9) 
kn (n — 1) ` (кп — 1) S? 
or 
Ў X 0-7.) Ov =J.) = (0—1) п 10) pS* (10) 
4=1 ізе/тші 


Substituting from (10) in (8), we get 


kn — 


Убиј, = E E р рде — 0) a) 


which is a convenient form for purposes of comparison with the 
variance of the mean of a random sample. The variance of 


a systematic sample relative to that of a random sample is seen 
to be 


y _ (nk — 1) {1 +p (n — D) 
aie GT (12) 


We notice that the relative precision depends on the value of 
p. For р = — (Кп — 1), the two methods give estimates of 
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equal precision; for p greater than — 1/(kn — 1), systematic 
sampling is less accurate than random sampling; while for p less 
than — l/(km — 1), systematic sampling is superior to random 
sampling. The minimum value which p can take is — 1/(n — 1), 
when the variance of a systematic sample will be zero, and the 
reduction in variance over random sampling will therefore be 
100%. The maximum value which p can assume is 1, when 
the efficiency of systematic relative to random sampling will be 
given by (k — 1)/(nk — 1). 


„Ла general, however, it is difficult to know what values p will 
‘take in populations distributed in space or time, and no general 
conclusions can therefore be drawn about the relative efficiency 
-of systematic and random sampling. On the other hand,- for 
populations for which the lists of units are prepared in alpha- 
betical or arbitrary order and where there is little likelihood of 
the lists corresponding to any physical distribution, we may 
assume the intra-class correlation to provide a good estimate of its 
average value in randomly formed columns, namely —l/(nk—1), 
and hence expect systematic and random sampling to give re 


sults 
of about equal precision on an average. 


It is instructive to express the variance of the systematic sample 
in terms of a further break-up of the intra-class correlation 
coefficient. It will be noticed from (10) that P is expressed as 
the sum of kn (n — 1) products of y deviations: 2k (n — 1) of 
these products relate to Y deviations separated by one row, 
2k (n — 2) of these products relate to У deviations Separated by 
two rows, etc. We may, therefore, write (10) as 


(21) (kn —1урва = 2 у "У" "У Oyy 


= 2 Xk (n — а) PS? + Кп — 1 
а=1 Кп 


` where p, is defined as 


A E D —3 o у | | 
Pu Eg =F A. E 


Á- 
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and is called the non-circular serial correlation coefficient for 
lag ka. Hence, we have 


п—1 


2 е 
r= n(n — 1) 2,6-9» bu (14) 


Substituting for p from (14) in (11), we get 


VO = PEST | УМ. -on (15) 


IN 


kn n 


а=1 


The expression is due to the Madows (1944). 


AS 


9.4 Comparison of Systematic with Stratified Random Sampling 


We shall consider the population as divided into л strata 
corresponding to the rows of the schematic diagram (1), and 
suppose that one unit is randomly drawn from each one of these 
strata, thus giving us a stratified sample of n. Clearly, the variance 
of the mean of this sample will be 


ро = i (1 р) Se (16) 


where Sw? is the mean square between units within rows, 
defined by 


Ss = = ТЕ m 2 » Qu = 7 (17) 


To examine how this compares with the variance of a systematic 
sample we shall first express the latter in a form suitable for 
direct comparison with (16). 


Equation (5) can be written as 


Е n n 2 77 wG2N 
" 1 1 1 е ga 
Убу = | | р У) 4 


іші 


| 
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k п 2 | 
= а 3. | ov) | 
ј=1 


іші 


= a |}, У О — > 


ісі  jzi 


a » р О — Р) 0 — Z (18) 


іші jÆj’=1 


The second term contains kn (п — 1) product terms, of which 
2k (n — 1) are products of y deviations from the respective strata 
means separated by one том, 2k (n — 2) are products of y 
deviations separated by two rows, +++, 2k (n — a) are products 
of y deviations Separated by a rows, ..., and 2k separated by 
(п — 1) rows. We can, therefore, write (18) as 


k n 
E 1 ) \ ) | = 
V (9;)s, bos kn? | Ох, =>) 
1 іші 


іш 


n—1 п-а 


4-2 » 2: > Qu — P3) ijra — 2 


іші а=1 ј=1 


kon 
1 В 
= Ет | (уу — yi)? 
i jor 


іші 


е2 (2 k (n — a) rw 


3 3 О» — J)? 
4 " (È n(k — 1) ) (19) 
where DU 


k n-a 

2; bx Qu Jj) (уа =D sia) 

с 2 п — а 

ї= =1 

Раш = n k (20) 
à ou =a)? 
n(k —1) 

іні іш 
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and is termed the within stratum serial correlation between 
observations separated by distance ka. Hence, substituting from 
(17) in terms of 5,,,2, we obtain 


А k=1 s k 2 br 
V (ӯ, )sy =i k s Я [ T ET 3 (n — a) rw (21) 
а=1 


The expression in this form was first derived by the Madows 
(1944). Comparing it with (16), we note that the relative efficiency 
of the systematic and stratified random samples depends on the 
values of Paw in the population, and no general conclusions can, 
therefore, be drawn. If the Р(ауш are all positive, the stratified 
sample will be superior to the systematic sample; if the pi; 
are zero, the two samples will provide estimates of equal 
precision. 


9.5 Comparison of Systematic with Simple and Stratified Random 
Samples for Certain Specified Populations 


(i) Linear Trend 


Let us suppose that the values of the successive units of 
the population increase in accordance with a linear law, so 
that 


ж = и +. (22) 


where u is a constant and л goes from 1 to М; 


Clearly, 
1 N 
= у) wth 
һ=1 
1 
= fü ep = (23) 


N 


Хм = А HENS DON 2 +eN(VED — (м) 


һ=1ї 
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and hence 


52 


1 Я x р 
) 3 муа 
Nue | Уһ өл 
h=1 


_ nk (nk +1) 
=e эз e3 


Similarly, since the observations within each row increase by unity, 
we have 


_ k(k +1) 
S = EC ED (26) 


"and for the same reason, since the column means corresponding 
to k different systematic samples also increase by unity, we have 
the mean square between column means given by 


2 _ kK(kK+1) 
S Ais елен ors (27) 


Substituting from (27) in (6), from (25) in (7), and: from: (26) in 
(16), we get 


8-1 
Рн = 5 з” (28) 
(k — 1) (nk + 1) 
y. = x (29) 
and 
5-1 
Құ = “үр eo 
Hence 
k 
Vsi И: Р = EEL ik din 
or, approximately 
e; 1 ШТ,» ы 


We notice that the v 


ariance of a stratified sample is only 
l/n-th the variance of a ) 


systematic sample, and the latter іп its 


4% 
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turn is also approximately 1/n-th the variance of a random sample. 
Stratified sampling is thus seen to be the most efficient of the 
three methods for removing the effects of a linear trend, with 
systematic sampling following it as the next best method. The 
reader may like to verify that рь is (k—1)/k for all values of 
a, thus explaining the loss of efficiency of systematic sampling 
compared to stratified random sampling. He will also find 
that р = —(k2n+1)/(k2n?—1), or approximately — 1/n, which 
accounts for the superiority of systematic sampling over random 
sampling. 


(ii) Periodic Variation 


We shall now consider populations in which sampling units 
with high and low values follow one another according to a regular 
pattern. Suppose such a population is represented by 


Fa = Si fa +(h— 1) т) 


where Л varies from 1 to an integral multiple of 20. Clearly, the 
successive sampling units will repeat themselves after every 20th 
value. A systematic five per cent. sample from such a population 
will consist of sampling units drawn from the same position of 
each cycle, giving an estimate which is no more accurate than 
a single value. А five per cent. random sample, on the other 
hand, will contain units from different parts of the cycles with 
the result that the means of such samples will vary within a 
narrower range than the means of different systematic samples, thus 
making random samples more efficient than systematic samples for 
removing the effect of a periodic trend. At the other extreme, if 
we select a ten per cent. systematic sample with two regularly 
spaced observations from each cycle, the first selected randomly 


‘out of the first 10 and the second chosen at a distance of 10 units 


from the first, then the mean based thereon will be identical 


‘with the population mean, thus making systematic sampling the 


most efficient of all sampling methods. It will be noticed there- 


. fore that the relative efficiency of systematic sampling for popula- 


tions showing periodic variation depends upon the choice of the 
interval between the successive units sought to be included in the 
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sample. In particular, if the interval coincides with the period of 
the cycle, the sample will contain units which are all alike, giving 
P = 1, and, consequently, the relative increase in variance of 
Systematic over random sampling is maximum. Of Course, in 
nature, regular periodicity is most unlikely to occur but the 
example serves to illustrate how the effectiveness of a systematic 
sample is influenced by the interval in sampling a population 
exhibiting a periodic trend. 


Gii) Natural Populations 


Systematic sampling is found to be both efficient and con- 
venient in sampling certain natural populations like forest areas 
for estimating the volume of timber (Hasel, 1942; Griffith, 1945— 
46) and areas under different types of cover (Osborne, 1942). 
We shall illustrate here its efficiency for sampling a certain natural 
population distributed in time. 


Example 9.1 


A pilot survey for investigating the possibility of estimating the 
Catch of marine fish was conducted in a sample of landing 
centres on the Malabar coast of India (LC.A.R., 1950). At 
each landing centre in the sample, a count was made of the 
number of boats landing every hour from 6 A.M. to бр.м. 
Out of the boats landing during each hour, the first one was 
Selected for observation on weight of fish, the product with the 
number of landing boats giving an estimate of the catch brought 
during the hour. Table 9.1 Shows the number of boats landing 
per hour at Quilandy Centre for seven consecutive Mondays. 
Calculate for each day the values of P between observations 
Separated. by k= 2, 3, 4 and 6 hours and hence investigate the 


effectiveness of systematic sampling relative to random sampling 
for making observations during the day. 


The first step in the calculation consists in making the analysis 
of variance tables on the number of bo 


each of the several cases (k =2 om esG J ; k= 
п=3; andk =6 Mm. 2 гі pir 4 
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Squares between (787) and within systematic samples (So) in 
the expression for p, namely, 


re 72757 Уз 


And finally we calculate the values of the variance’ of systematic 
relative to random sampling from (12). The calculations are 
illustrated in Table 9.2 with reference to data collected on the 
third Monday. 


Table 9.3 presents the values of p and those of the variance 
of systematic relative to random sampling for all cases. It will 
be seen that p is negative for all except three cases, and smaller 
than —1/11, thus showing the superiority of systematic over 
random sampling for making observations. Further, it will be 
seen that the superiority generally improves with the size of the 
systematic sample. 


TABLE 9.1 


Number of Boats Landing during Each of 12 Hours 
(6 a.m. to 6 p.m.) on Seven Consecutive Mondays 


i" Hour 

оГрау 67 78 89 910 10-11 11.12 12-1 1-2 23 34 45 5-6 
Week 
1 42 5> 19 6 23 56 36 59 14 14 #2 6 
2 28 39 33 27 9 82: 39 45 9 6 ebi D 
3 Sr 25 18 6 Б 9 w 2 9 Bi 10 07 
4 4L ъъ 19 М0 27 215350 45 5: 
5 8 16 м 8 47 39 20 26 33° 34 61 45 
6 2 6 20 28 28 м оя 4 37 20. 51 
7 16 15 4 м 12 4 45 10 17 10 МЕ 16 
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TABLE 9.2 


Analysis of Variance Table together with the Values of p and of 
the Variance of Systematic Relative to Random Sampling for the 
Data Collected on the Third Monday 


К--6,п--2 k=4,n=3 EEA =4 k=2,n=6 


Source of Variation 
p... Mean pp, Mean pp Mean pp Mean 


Square Square Square Square 
Between (18,2) ecd) 47-08 3 52-97 2 70-08 1 0-75 
Within (6,2) e 6 136-58 8 112-00 9 101-64 10 105-42 
Total v ga Wl 95-90 11 95-90 11 95-90 11 95-90 
р Sc 2 —0-55 —0-27 —0-16 —0-199 
ИН M e 49 55 73 1 
Vg 
TABLE 9.3 


Values of the Intra-Class Coefficient of Correlation and of the Variance 


of Systematic Relative to Random Sampling қ 
шиит ы кс е ЛЫ 2” 


Values of p % узи 

Р 
3 4 6 2 3 4 6 
=25 —<26 — > 147 61 30 63 
2 = йб слу – ле 7 33 63 37 
3 ашы; 2276, 16 -2.105 49 55 73 1 
4 0 = = qo 78 51 8 12 
5 35510-48 4.04 246 166 5 154 7 
é 0 олы 2 ton 23. 55 7 : 
F LL -46. d 46 бтп Bb3 
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9.6: Estimation of the Variance 


Since a systematic sample is a random sample of one cluster 
only, no estimate of the variance can be formed from the sample. 
This is a great handicap of the method which otherwise offers 


great advantages. 


Certain approximations are used in practice to calculate the 
variance. One of these consists in treating the systematic sample 
as if it were a random sample of z units and calculating the 
variance, using the formula 


1 1 2 
= 2 33 
(5 ж) Ses 6%) 


where Swe? is the mean square between units within the selected 
systematic sample, 1.6., a column of diagram (1). Clearly, (33) 
does not provide an unbiased estimate of (6),. for, 


2 1 З ‚ә 52 
Е (з?) = ТІЗЕДІ Е (Èv мг) 
іші 


т E I] nee: J 


= [ Ж 2 yif—n (y.?-- Т] 


іші ізі 


| 


or, substituting from (11) in the right-hand side, 


-fik Dw 


ізі ізі 


К —1 
"Mu ДЕЕ] 


та МЕ mc (34) 
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Thus the expected value of (33) is given by 
1 Ту пк—1 
(5 ж) nk 5 —>2) 


which clearly is different from the variance of the systematic 
sample given by (11), unless p = —l/(nk—1), that is, unless the 
Systematic sample behaves as a random sample of the population. 
It is only when the units in the population are randomly ordered 
that we may expect (33) to provide a satisfactory estimate of the 
variance of a systematic sample. If they are not randomly 
ordered, as will be the case in natural populations distributed in 
time and space, the effect of using (33) to estimate the variance 
will obviously be to under-estimate the variance on an average, 
if the intra-class correlation among units of the systematic sample 
is larger than —1/(nk—1), and vice versa. 


Another approximation sometimes used 
variance from the sample is 


for calculating the 


ПІ = 2) 2 Qu = Vi m)? 
n nk 2 (n — 1) (35) 


This again will be a biased estimate, for we may write 


E LE Ou = њу] 


n—1 
= Е р Ou — y; T = уд + = Ja] 
"n—1 


= Е b (o — »)— (у, ж — Ум) + 0;- Pad] 


n—1 
=E А {Оз — FPF + (у, nm = Ја) + 6, — зы)? 
=> Qu =) (у; жа — Уа) 


+2 È; — Уа) Ох, =) 


72677929 (ињ, 09 
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Clearly, the value of the first and the second term is equal to 
(1—1) Sy? each, that of the fourth is equal to —2 (n—1) Swr Pow: 
and that of the fifth and sixth is zero each. Hence we may write 


n—1 
1 1 A (уу — Ум) 
( i ) Ж uem 
n nk 2 (n — 1) 


k—1 1 ° 
= 5 еее D | (1 — Pay) 


k n 
1 n—1 
л” Dt) 7 (9; = (37) 


or, substituting from (21), 


gar 
Sor ух k—1 


(n — a) Ра) 


m — -ia Sior Payo 


= Fs, 2 


а 
Ш 
- 


к-і 
+ гр 2kn (n — 1) уу Oy — лы)? Q8) 


It is difficult to put an easy interpretation on (38). We may say 
that if the differences between neighbouring rows are counter- 
balanced by the within stratum serial correlation, (35) may serve 
to give a fair idea of the variance of a systematic sample. 


9.7 Two-Stage Sample: Equal Units: Systematic Sampling of 
Second-Stage Units 


The method of systematic sampling can be used for the selec- 
tion of a two-stage sample at either of the two stages or both. 
Of these schemes, of more interest is the one with systematic 
selection at the second stage, partly because this permits the 
estimation of the sampling error, and also because it enables the 
selection of the second-stage units to be entrusted to the field 
staff without great risk of errors and facilitates control of field 
work. The other two schemes do not possess the first advantage 
and for this reason will not be considered in the book, although 
their theory is straightforward. 

28 
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We shall suppose that the population contains NM units, 
grouped into N first-stage units of M second-stage units each, and 
consider a scheme of sampling in which a sample of п first-stage 
units is selected with equal probabilities, and т second-stage units 
are selected from each one of the n first-stage units by the method 
of systematic sampling. 


Following the previous notation for two-stage sampling, let yj 
denote the sample mean and ӯ. the population mean, defined by 


; 1 n m 
Jum = m DD Mu 
и. 
N M и. 
c уйй 
У. = NM Уч 


i-i ізі 


Clearly, ӯ пт provides an unbiased estimate of the 


а population mean 
у... For, 


EOm) = 5E [2 £6.10] 


n 


| 


па { А) 


=J. 


(39) 
To obtain the variance of рат, we write 
VG) = EG, —F,, 
= E Oum — Fn, + у, — ў) 
= EQ. =J HEO, mL 
T2EG, —3,) Gn. — D.) (40) 


Consider the first term in (40). We have 


Ебу = s E [È ба аә] 
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1 ыра oss 
= Е iz Cin — н.) 


+ 2 ба) 6 —59]] 


ы (Е X Е{(фы„ — ЖЗ | 


т 


+E ÈE (Om) Oen —92120] 
и 
(41) 


The value of the second term of (41) is clearly zero, since samples 

are independently drawn from the i-th and i’-th first-stage units. 

The value of the first term of (41) is derived from (11). We 

have 

S? 
{1 + p (m — 1) (42) 


m 


= AUT. 1 
E Gm 5.) 1 = (1 = 1) 
where S;? denotes the mean square between the second-stage units 
of the i-th selected first-stage unit, and p; denotes the intra-class 
correlation between second-stage units within M/m columns of 


m units each which can be formed out of the M units of the 
i-th selected first-stage unit. 


Substituting from (42) in (41), we then obtain 


Е0һ-А» = JE У (1– 1) азаю -»| 


= mx (! и) у Sè (+ (m—1) (43) 


The value of the second term in (40) is known to be given by 


а“ у) 8 (44) 
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while the last term is obviously zero. Interchanging the first and 


the second terms in (40), and substituting from (43) and (44), we 
thus obtain 


+ aay (1 =a) Уз {1 +» (m—1)} (45) 


If S? is of the same order for all i, say equal to S,,?, as is likely 
to be the case when the first-stage units are equal in size, we get 


У it 1 1 " i 
YO, = (5) 52 А. (1-5) Se 050-1 
(46) 
where 


Had the first-stage units been sele 


cted with replacement, equation 
(46) would be further simplified, 


giving 
= 5,2 1 1 
VO. — t as (1— 35) S0 лон — 9) (47) 


If pj = —l/(M—1), the method of systematic sampling for the 
selection of second-stage units will be equivalent to a method of 


random sampling, and we shall be left with the familiar expression 
for the variance of the mean of a two 


-stage random sample, 
namely, 
Ed 1 31 
VOS) = G = x) ВРА уј Se (48) 


To obtain the estimate of t 


he variance, we consider the mean 
Square between first-stage unit 


means in the sample, 5,2, defined by 


LOW. . 
5,5 = Hed Ж Cin — Fan)? 


Taking expectations of both sides in (49), 


(49) 
we may write 


(n— 1): E(s2) = Е (2 за) — пЕ (б) (50) 


| 


SYSTEMATIC SAMPLING 437 


Now to evaluate the first term in (50), we write 


Е(2 Fun?) 


E Í È Е0,210) 


І 


E | È D+ Y Oul rus Ё 


#[ фе + (1 – 1) 5: 


x (1 +p: (m — »}| 


N о 
-ipti 
= m 


х 1 + m-D} (6D 


The value of the second term in (50) is directly obtained from 
(45), for, 


NE (быз) = 29.2 Уба) 


gat. tes б. 
«ш b: T Є =) 8 яя ; (1 M) N 
N 
x )| +e- D) i (52) 


іші 


Substituting from (51) and (52) in (50) and dividing by (п — 1), 
we get 


Бој = 52 + (1 у) у ЕДІ (53) 


We шау, therefore, write from (45) 
Est. Уб) = $ | (54) 


neglecting Sy?/N. In other words, sp?/n supplies us with an upper 
bound for the estimate of the variance of the sample mean, which 
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should be satisfactory for all practical purposes, particularly when 
Nis large. When the first-stage units are selected with replacement, 
sp2/n gives an unbiased estimate of the sampling variance. 


9.8 Two-Stage Sample: Unequal Units: Systematic Sampling of 
Second-Stage Units 


In this section we shall extend the previous results to the case 
of unequal units. We shall suppose that the population consists 
of N first-stage units with the i-th unit containing M; second- 
stage units (i = 1, 2, ..., М) and further consider a scheme of 
sampling in which the first-stage units are selected with replacement 
with varying probabilities P,, Р,, ..., Py and the second-stage 
units within the selected first-stage units are selected by the 
method of systematic sampling. Let п denote the number of 
first-stage units to be selected in the sample and m; the number 
of second-stage units to be selected from the i-th first-stage unit 
if included in the sample. Further, for simplicity, let m; be so 
determined that M;/m; is an integer. 

Following the treatment in Section 8.2, let 


— Mi Уа 
Ae: Pi (55) 


and consider the estimate 


2 = лт) 


DES 
=A Ў; Zitmi) (56) 


Clearly, 2; provides an unbiased estimate of the population mean 


Y...» * Bor, 
T. 
E | 2 E (Zia a 


E(Z) 


| 
[55] 
zie 


1 
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in virtue of (4), 


ај (57) 
To obtain the sampling variance of Z,, we write 
VG) = EE —2.)° 
= Е amp ==. Ра. — 2. j 
= E Zam — 2.) + E (Za. — 2.) 
+ 2E (Ето — Za.) (Za. — 2..)} (58) 
Taking the first term in (58), we have 


E (imo EY = LE b ар 


1 ж. = ча 
=- E {3 (Zam — 8.) 


|| 


t 2, (Zima — 2.) Gron Zr. j 


+ ZF (uno — 2.) 1 


х E {rmn 2.) 19] 


The expression under the summation sign in the first term repre- 
sents the variance of the mean of a systematic sample of mj 
selected out of M; and can be written from (42). The value of 
the second term is clearly zero. We, therefore, write 


Е (atm Е 2,.)* 


| 
ш- 
ty 
ES 
~ 
© 
E 
7 
aS 


ОУ 


l 
=) 

x ім 
Eo 
A 

| 
&j- 
з 


Бе пари ЖЭ) 
т; Р: i 


| 
Mm 
F 
ту 
| 
S|- 
М7 
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The value of the second term in (58) is known to be op,2/n, 
where 
N 


са = XP (2, -- zy 


ізі 


The last term in (58) is obviously zero. Hence we have 


N 
Jama] E PM 
уа) -t 1 2. Р, (1 - in) nt (Ld palm — 0) 
іші 
(59) 


То obtain an estimate of У (Zs), we consider the mean square 
between the first-stage unit means in the sample, namely, 


DEI б 
5 = num А, (бим — ЕЛЕУ (60) 


Expanding and taking expectations, we write from (17) of 
Section 8.3, 


> 1 
Bn - тел | WAP ађан) (У би) t 23] (в) 


where У (Zimm,))sy is the variance of the estimate based on a sample 
of one first-stage unit selected with probability P; and mi second- 
stage units selected by the method of Systematic sampling there- 
from. Substituting from (59) in (61), we therefore get 


Е (5,3) = ny E) 


whence 


in this section. 


Example 9,2 
Reference has been made in Example 9.1 to a pilot survey for 


estimating the total catch of marine fish conducted al ong a 
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100-mile strip of Malabar coast during 1950. The landing centres 
along the coast were first divided into two groups: (a) those 
known for their high fishing activity, and (6) all the rest. From 
the latter stratum, comprising 59 centres, a sample of 10 centres 
was selected with replacement with probability proportional to the 
number of fishing boats as enumerated at the previous census. 
The number of boats for all the 59 centres was 4573. Table 9.4 
gives for the selected centres the number of boats at the previous 
census and the estimated average catch per hour for a certain 
day of the survey based on each of two systematic samples of 
m=6 and m = 2 hours. Make an estimate of the average 
catch per hour for the stratum and calculate its standard error. 


It is proposed to extend the survey to the entire Indian coast. 
Use the Malabar experience to determine the number of centres 
required for estimating the daily catch with, say, 5%, standard 
error. 

Let А; be the number of boats at the i-th centre and 


N 
A= X A; 4573 the total number of boats for the stratum. 
іші 
Clearly then the selection probability Р; for the i-th centre is 
given by Р; = АјА. Further, let Pim be the average catch per 
hour at the i-th centre based on a sample of m hours. Then the 
estimated average catch per hour for the entire stratum and its 
standard error are respectively given by Znm and spz/4/n, where 
Em n 
“оп 


and 


CREE = 4,5 
Zin = NP, 594, Ут 
The values of Zim for (һе selected centres are given in the last 
two columns of Table 9.4, and the various steps in the compu- 
tation of the average catch per hour and its standard error are 
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shown below the table. It is seen that for m = 6, the estimated 
average catch per hour works out to 735-8 mds. with a standard 
error of 164-9; and for m =2, the average catch per hour 
works out to 759-1 mds. with a standard error of 220-8. 
Expressed as percentage of the estimated mean, the standard errors 
are 227; and 299; respectively. 


TABLE 9.4 


The Number of Enumerated Boats and the Estimated Average 
Catch in Mds.|Hr. for a Sample of 10 Landing Centres 
along the Malabar Coast 


Estimated Average 
Serial No. of Catch (Mds./Hr.) 


Zim = = Jim 
No. of the B От) "594 
tı 
centre Ы m=6 m=2 m=6 m=2 
а) (2) (3) (4) (5) (6) 
1 68 387-50 729-50 441-68 831-51 
2 36 840-17 ' 1123-00 1808-90 2417-83 
3 96 1094-17 754-00 883-41 608-76 
4 45 61-67 55-00 106-22 94-73 
5 103 899-00 1072-00 676-51 806-69 
6 74 1401-67 1277-00 1468-13 1337-54 
7 127 677-33 162-50 413-38 99:17 
8 174 1331-17 634-00 592-97 282-42 
9 12 76-50 54-50 494-12 352-02 
10 18 109-83 176-50 472-93 760-01 


(1) Total = X žim 


. z 7358-25 7590 · 68 
(2) Estimated average catch (Mds./Hr.) 


oe 735-83 759-07 
(3) Correction factor = Бб Žim)? 

(4) Crude 5.5, = X (2,5) 

€) Adjusted 85. =I End — (угу 


(6) Mean square = 5,2 


541438431 5761842-29 
7862281-63 10147764-33 
2447897-32 4385922 -04 
271988 · 59 487324-67 


271986 48732.47 
(8) Standard error (Mds.) 


oe .. 164-9 220-8 
(9) Standard error as % of estimated 
average 2 tU 22 29 


(10) No. of centres for 5% S.E, 201 339 


(7) Variance (2,7) = Ба 


— a a 1t 
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The number of centres required for estimating the daily catch. 
along the entire coast with 5% standard error is given by 


pem ERES 
77 0025522 
== 201 for m=6 
= 339 for т = 2 
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СНАРТЕЕ Х 


NON-SAMPLING ERRORS 
A. OBSERVATIONAL ERRORS 


10a.1 Introduction 


In developing the sampling theory in the preceding chapters, 
we assumed that the character observed on the i-th unit of the 
population (i —1,2, ..., N), takes a unique value y; whenever 
the unit is included in the sample, irrespective of the person 
who enumerates it. By implication we assumed that a complete 
count of all the N units gives a unique value for the mean or 
the total of the population. In practice, however, the situation 
is rarely so simple as the one described above, since the value 
observed on any unit will also depend upon the enumerator 
reporting the value. Thus, an eye-estimate of the yield of a crop 
in a field will depend upon the judgment of the enumerator making 
the estimate and will invariably be different from the true value 
of the yield obtained by harvesting the crop in the field. The 
magnitude and direction of the difference will depend upon the 


the selected unit or interviewee a 
value. Even with factual charact 
e.g., the area under crops, the nu 
etc., there is found to be a marke 
of the same or different enumerators. 
the sampling fraction is unity, or. 
count of all the N units is made, t 
counts. As the errors responsible 
process of collecting data, 


but are also referred to as response errors (Hansen et al, 1951). 


5 from incomplete samples and 


faulty procedures of estimation, they go to make up what are 


termed as non-sampling errors. 


We have given several examples in Chapter I to show that 
the net effect of non-sampling errors on the value of the estimate 
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can sometimes be serious, and that it is, therefore, important to 
control them as far as practicable. In Part A of this chapter 
we shall deal with the measurement and control of observational 
errors and in Part B with the treatment of incomplete samples. 


10a.2 Mathematical Model for the Measurement of Observational 
Errors 


Let x; (i =1, 2, ..., 4), denote the true value of the character 


on the i-th unit in a simple random sample of / units drawn from 
N units, and угз, 


| ae 
J=1, 2, ...„т 


b= 0, 1,0; assy tt 


ij 

denote the value reported by the j-th enumerator on the i-th unit 
for the k-th occasion. It will be seen that m enumerators have 
been assumed to participate in the survey, with the j-th enumerator 
making nij observations on the i-th unit in the sàmple. 


А The difference between the reported value and the true value 
is called the error of observation, and for any given measurement 
technique will depend upon the enumerator reporting the value, 
the interaction of the enumerator with the true value of the unit, 
and the mood and like causes at the time of reporting. The 
reported value may, therefore, be considered as being made up 
of four uncorrelated components as follows: 


Jus =X; + ау + бу + ue a) 


where 
aj represents the bias of the j-th enumerator in repeated 
observations on all units, 
б the interaction of the j-th enumerator with 
the i-th unit, 


М the deviation from x; + ај + би when the 
j-th enumerator reports on the i-th unit on 


the k-th occasion, 
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Equations (4) and (5) are now simplified, being given by 


y= : ~ хп +a; + ii Ў. ету (6) 
i 
and 
1 
y. - ati ate) me (7) 
i i i 
It follows that 
N 1 M 
ды 1 
EG) = Dat zy а 
іші i71 
=p+ā (8) 


where p is the population mean of the true values, to be estimated, 
and а is the population mean of enumerators’ biases. Also, 


EQ.)-—p-à (9) 


It will be seen that the sample mean y. does not provide an 
unbiased estimate of и, unless «js vary in such a way that a is 
Zero. Experience indicates that although a; is usually variable 
from enumerator to enumerator, a is not always negligible. Thus 
in estimating the crop by eye, there is a tendency to under-estimate 
the crop in good years and over-estimate it in bad years. Conse- 
quently, in a good year the bulk of the enumerators under- 
estimate the crop, resulting in a significant negative value for à, 
and in a bad year the bulk of them over-estimate it giving а 
a significant positive value. Again, as pointed out in Chapter I, 
when a crop is unevenly sown as in India, sample-harvesting it by 
small plots, such as those marked with a portable frame, may 
result in over-estimation of yield (Sukhatme, 1947). It is, there- 
fore, of the highest importance in Surveys to ensure that the 
bias а is negligible. 


(c) Variance of the Sample Mean 
By definition, 
VO) =E 0; -EJF 
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Substituting for уз and E (J.j) from (6) and (8), we have 


V =f | 2 хпу—р +a, — а a » “| (10) 


On squaring the expression within brackets in (10), taking 
expectations term by term, and noting that the expectations of 
the product terms are zero, we have 


Р x 2 
= 1 EC 
VO) = Е t Z хп), -) + Е (а, — а) 
i А ° 
il 
+ i E (È 2 (11) 
i 


To evaluate the first term in (11), we write 


p EUN УМ 
E Я a Xy — Hu == ж Е (2 XQ 


+ oem ла) = р? 


h 
3 {2 E (x?) nj? 


Li о 
+ XE (xx) пла} =й 


it! 


Substituting for Е (ха) and Е (xxv) from (21) and (34) of 
Chapter IL, we have 


һ 
1 RN cdm =. ‚2 
Е (1 ?, Xd — ) = ie ар Gu 182 = Nj 
i 
h 


iA 
h 
ТЕ : 
= 92 DE 
i 
1 1 
= -— < 2 
= s? (5 – y) (12) 


29 
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where 


52 = == ), (x; — py 


N 
іші 


The value of the second term іп (11) is given by 


E (s, 3? =8, (1 — 3) 


where 


while that of the third term in (11), by 


(13) 


(14) 


(15) 


ћ 2 

1 | ; 1 Д ^ а 

m E (È sra) | i FH E [è «рт ut вуветут 
7 ізі 


1 h 
= zs [È (et |i) па 


E 


h 
Ез ZF (чөе |) 20 


«ul 


1 h 
=; (se 5 nj) 


- 5е 
я HE 3 (16) 
Since 
Е (є? |j) = S =S} 
and 
E (e; е) =0 
Hence, substituting from (12), (14) and (16) in (11), we get 
= il 1 
VG) = S? (= — _ +§2 A 1 
(; к) а (1 x) uL (17) 


For N and M infinitely large, we have 


Уб) = ; GS» s; 


(18) 
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We write 


451 
VG.) = EG, – EG)! 


To obtain the variance of ӯ, we proceed іп a similar way. 


zs Ys tu ERE 


Now, since 


; E 
n 
population, 


m h 2 
1 
Чез (Ў |» Ж (19) 
і 4 
Xj, Ж; «+05 Xp IS а Simple 
‘ А Sag 
= ( а 


random sample of the 
a fl 1 
MI =82(; A 
Similarly, 


/ 


i 


(20) 
1 ~ i 
(22) 


se (а — и) 


It remains to evaluate the third term only. We have 


(н) а bE чч) 


+ AE 12 7 (2 «ль) 
n) 
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= aE [Fe [8 чуй] 


т h E 
+e EZ (5 чл, /) 
а д i 


n 
XE (2 буру /)} 


= (28 (22) 
ѕіпсе 
һ Е h E 
E (2 єп; |) =0 =E (è «уу li’) 
i 


Hence, substituting from (20), (21) and (22) in (19), we obtain 


b 1 1 afl -E 52 
Уб.) -8 (1 – y) + 52 (1—1 + оз) 


For N апа M infinitely large, we have 


ом USE LESS , 52 
VG.) = пр + Se + (24) 


Further, if p =1, which will usually be the case in practice, 
the expression for V (y.) will be simplified, being given by 


с 524-542 2 
V(y) = Set Se + + (25) 


which can be alternatively expressed as 


(26) 


since the variance of a single observation drawn from an infinite 
population when Mis infinitely large is clearly $,?+S24 5» — 5,2. 


Ша. 
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The above formule must be considered as fundamental formule 
in the theory of sample surveys. They show that the sampling 
variance of the estimate is not entirely due to errors arising from 
chance variation in selection of the sample of h units, but is 
inflated by the variability in biases of the enumerators. Conse- 
quently, the formule given in the previous chapters under-estimate 
the actual sampling variance to which the estimates are subject. 
This emphasises another point that it is not sufficient in surveys 
to ensure that the a;'s cancel each other on the average, but that 
it is necessary also to see that the effect of variability in the 
aj's is reduced to the minimum, by taking the maximum care in 
planning details of the surveys such as the questionnaire, the 
method of making observations, the training of field staff and the 
supervision over their work. 

It follows too that the variance of the sample estimate does 
not reduce to zero even when a complete count is made. For, 
the limit of equation (26) when Л = М and М becomes infinitely 
large is given by 

S, 


a 
m' 


VG, |h=oo) = 


where m’ = М/йЙ is the number of enumerators required for the 
complete count and selected out of a population of M' enumerators 
assumed infinitely large, and where S, represents the variability 
of the biases of the population of enumerators under census 
conditions. This expression provides us with an important con- 
sideration in determining the relative roles of complete enumera- 
tion and random sampling. In particular, we notice that a sample 
may give a more reliable estimate than a census when 


5% 5; AP 

CLERI M 
It is in fact in this possibility of controlling the magnitude of the 
variability among the «;'s in a relatively small survey compared to 
a census by adopting more refined methods of enumeration, not 
always possible in a census, and by recruiting better-trained and 
better-paid enumerators, that lies the effective role of sampling 
for collecting information. 
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10a.4 Estimation of the Different Components of Variance 


Let sg? denote the mean square between the means of m 
enumerators, defined by 


sè = =s 7 Q3—»» Q7) 
ј 


Then expanding and taking expectations, we write 


(т —1)-E(s2) = 2 E (5.2) — mE (2) 


5 VO) ЖЕ ФР] — m VG.) -- CEG..Y] 


ду Ф + (а + 3) — m (V G..) + Wa} 


- 2 VG) —mV(G.) Q8) 


Substituting from (17) and (23) in Q8), we get 


Ебу-= 523. 5. BAP 4 524 x; 52 


hp т—1 (29) 


Similarly, denoting by seo? the mean Square between observations 
within enumerators, defined by 


i wr ; 
ig’ = тетт 25 z Oy — ў)? (30) 
i i 


we get 


(т) = ЕЎ А-а) z) 
4 і 


(31) 
Now putting A = 1 in (17), we have 


Yo) = 82 (1 – x) + 5, (1 – м) + 82 (32) 


сатын 
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Hence, substituting from (17) and (32) in (31), we get 
Е (5,2) =S2 + 52 (33) 


Equations (29) and (33) must be supplemented by additional 
information in order to be able to estimate 5,2, 52 and S, 
separately. We shall, therefore, consider yet another mean square, 
namely, that between unit means, 5°, given by 


h 
1 = = ie 
5 = Jud 9 (9. — 7.) (34) 
Expanding and taking expectations, we write 
ћ 
(0—10) EG) = 2 EG) — hE.) 


= E06) — вуд.) (55) 


The number of enumerators making observations on any unit is 
p. Hence, from (23), we get 


1 1 1 
5 EN = ie A = e 
VG.) =8 (1 — у) +8#([5— и) +; 3 (36) 
Substituting from (36) and (23) in (35), we obtain 
Я ПРЕ h m— 3 
Fg) = 8b Sitya y (37) 


The set of three equations (29), (33) апа (37) provides the 
estimates of 5,7, 5,2 and S. In particular, we obtain 


p (mn — 1) (h — 1) а т 
pmh — ph — рт + т s ro h (m — 1) 5 


Est. S? 


т? n 
“рату 
(38) 
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In practice, however, as already noted, p will be 1. The 
equations (29), (33) and (37) then simplify as follows: 


EGA) = 7 (52-52) +S. 89 
Е (5,2) = 52 + 5,2 (40) 
апа š 
2 2 ћ =] 2 
EGP =S? +S? ++. ge (41) 


The three equations are no longer linearly independent. For, 


when p =1, the three mean Squares are connected by the 
identity : 


(п — 1) s? = (m — 1) н. Sè + (А — т) Sa? (42) 


We further notice that 5° + S? occurs together in all the 


equations. Solving then for S, and 5,7 + 5,2 together any two 
' of the three equations, we obtain 


m о 
Est.S,2= 52 — ™ 52 


h (43) 


and 


Est. (S? + 5,2) = sa? (44) 
The expression for the estimate of 82 
derived directly from (38) by putting p 
Sy” from (42). It follows from (25) tha 
large and p =1, 


given in (43) can also be 
=] and substituting for 
t for N and М infinitely 


5 oe 
Est. У (ў) = “© 


(45) 
which, using (42), can also be put in the alternative form 
= 1 h—m 1 
Hel WO) жан = ЈЕ р Че — 502) (46) 


| 
| 
| 
| 
| 
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Equation (46) shows that s,°/h no longer gives an unbiased 
estimate of the variance of the estimated mean but that it is 
inflated by a component 


h—ml 


mol A Se 590) 


the latter vanishing when the differential biases are absent. 


Example 10.1 


This example is taken from the crop survey for estimation of 
the average yield of wheat conducted in Sind (Pakistan) in 
1945-46. The design of the survey was stratified multi-stage 
sampling with subdivisions as the strata, a village as the first- 
stage unit, a field as the second-stage unit of sampling and a plot 
of 1/40 acre as the ultimate unit of sampling. Within each 
stratum the work was divided into two independent samples, one. 
to be carried out by an official of the Department of Revenue, 
and the other by an official of the Department of Agriculture. 
Table 10.1 shows the estimates of the average yield for the two 
samples, together with the pooled analysis of variance of the 
whole sample. For one stratum, namely subdivision Kambar, 


TABLE 10.1 
Yield Survey on Wheat : Subdivision Kambar, Sind (Pakistan) 


Estimates of Average Yield and Analysis of Variance 


Revenue Agriculture Combined 
Mean yield іп oz./plot — .. zs 100-3 54-9 85-6 
Number of experiments .. Ма 25 12 37 


° 


Analysis of Variance (Oz.jPlot)? 


Source D.F. Mean Square 

Between enumerators 1 16714-6 (= E) 
Between villages within enumerators 15 7377-8 (= В) 
20 315-4 (= W) 


Within villages 
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10a.5 The Mean and Variance of a Stratified Sample in which 


Enumerators are Assigned the Units in their Respective 
Strata 


The method of assigning enumerators considered in the previous 
sections, in which the units in the sample are randomly distributed 
among the different enumerators, is not common in practice, 
owing to the large travel costs it involves. The more common 
method is to assign neighbouring units in the sample falling in 
specified geographical areas, to two or more enumerators as may 
be needed, in the form of replicated sub-samples. As an example 
we may mention the assignment practice followed in crop surveys 
in India and described in Example 10.1. The design of the crop 
survey is a stratified sample with geographical divisions forming 
the strata and the villages selected from any stratum are randomly 
distributed among the requisite number of enumerators drawn 
locally. We shall first consider the case of a stratified sample 
of h units drawn from the k geographic strata of the population, 


9 k 

with Ж units drawn from the f-th stratum, so that 2 hy =. 
іші 

merator сап 

g the period of the survey, 


We shall suppose that within any stratum an enu 
enumerate a sample of ñ units durin 
so that т enumerators will be needed to enumerate hy units in 
the ¢-th stratum, where ñm =M. We shall suppose that the hy 
units in the ¢-th stratum аге randomly distributed among the тұ 
enumerators and further suppose, for the sake of simplicity in the 
discussion, that the number of units in the population in the 
t-th stratum, N+, and also the population of 


potential enumerators, 
М}, are infinitely large. 


It will be seen that we have assumed that 
р =1; consequently, we cannot separately evaluate 5° and S. 

We shall use the same notation as in the previous sections 
except for the introduction of the letter ¢ to indicate the t-th 
stratum. We shall denote by 


xt —1,2,...,h) the true value of the ith unit in the 


sample of / units drawn from the 


t-th stratum, 
and 
у 


enumerator out of ут in the stratum. 


the value reported on xj by the j-th ~ 


| 
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Clearly, the sample mean for the ¢-th stratum will be given by 


ht me 

- L ұл 1 

кер dads (47) 
i 3 


while the sample mean for the population will be given by 


Ў. = Xn y! 
t=1 
where 
N, 
р = X (48) 


It follows from the results of the previous sections that 
E i! |h) =p + а, (49) 


where p; =the population mean of true values in the /-th 
stratum, and а; =the population mean of the biases of enume- 
rators in the /-th stratum. Also 


k 
E Ð.. | May ha...) = X р, Е(9. | h) 
t=1 


t=1 
=p+a (50) 
Similarly 
" 5,2 52 Sia 
г!) = 32 + у + Se G1) 


where 8,12 is the variance of xit, Sta? the variance of а; and 52, 
for the sake of simplicity, is assumed to be constant from stratum 


to stratum; and 


[7 
VG. | hs May  ћ) = 2 РЁ VG |h) 


k 
2 2 2 
- Se SE 8) өз 
t 
t=1 
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The estimates of Stx? + S? and of Si” are provided by the 
same formule as before. We have 


2 3 
Est. (5,2 күз Se) = 5ш) (5 ) 
and 
54 
Est. S4 = 5,2 ТУ E о) ( ) 
t 
whence 
2 
Est. V(!|h) = = (55) 
m, 
and 
k А Ж 
Est. V (Ӯ |, ho --- h) = > pp tie (56) 
іші 


Example 10.2 


The data for this example are derived from a pilot survey for 
comparing the relative efficiency of plots of different size in 
estimating the average yield of irrigated wheat in Moradabad 
District of U.P. (India) in 1944-45. The design of the survey 
and the method of assigning enumerators were similar to those 
described in Example 10.1. Thus, in each of the five subdivisions 
of the district, two independent samples of two villages each were 
selected and allocated to two enumerators designated A and B. 
In each village two fields were selected and in each field two 
plots of each size were marked. The data relating to the plot size 
of an equilateral triangle of side 25 links (117-9 sq.ft.) are taken here 
for illustration. Table 10.2 gives estimates of the average yield 
together with the analysis of variance for individual subdivisions, 
and also the pooled values for the district. Estimate the contri- 
bution to the total variation due to 5,2, 

The calculations are straightforward. Substituting from the 
table the values of E and B in the formula 


т 
2 = 2 t = 
Est. S = Sie i55 азу 
п, 


(Е, = В) 


8 
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we obtain the values shown in the last row of the table. The 
average magnitude of S,? over the five subdivisions works out 
“to 6569.3. 


Also 
Est. (5,2 + 52) = 2 


whence the average value of S;,2 + 6,2 for the district works out 
to 13074-8. The total variance of an observation is given by 


5» = 8,2 + $2 + 52 
= 19644-1 
Thus 52 accounts for about 33% of the total variation. 
It should be pointed out that both in this and the previous 
examples the estimates of the component of variance due to 


differential bias are based on small numbers and, therefore, not 


sufficiently reliable. They are presented here only for purposes 
of illustration. 


10a.6 The Mean and Variance of an Unstratified Sample in which 
Enumerators are Assigned Neighbouring Units 


Sometimes p, is not known and yet we wish to assign to the 
enumerators neighbouring units in the sample falling in specified 
geographical areas, in the manner indicated above, after an 
unstratified sample of h has been selected from the population. 
In this situation the number of units falling in the t- 
namely h+, is a random variable, and so is mt, 
of p, is provided by A/h. 


The sample mean y. for this case is given by 


th stratum, 
while the estimate 


k » 
" 1 = 
ет Ж А (57) 
іші 
and its expected value and variance for a given set of fy, ho, 
.... ћу Бу 


k 
Є 1 
EG. |h hs sh) = р } Ilo ba) (58) 
іші 
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and 


т, 


& 
РОТА, А ea h) =) 6 E S ы x 
t 


k 
1 МАРИ 
SON LIN 


(59) 


It is seen from (58) that the conditional expected value of y.. does 
not equal и + а. To the expression (59) we must, therefore, 
add the square of the bias component and then take the expecta- 
tion in order to obtain the variance of у. We write 


VQ. ) = ЕУ O.. |, №...) + ELE O.. | May №...) р)? 


k k 
-Е р һә tR), h sj 
t=1 t 


1=1 


я p (к-п) (и + zi (60) 


The value of the first part in (60) is clearly given by 


k is k 
РИС, (61) 
іші іші 


while for the second, we write 


E n ы. nd n) (m E 


ізі 


xix n E = n) (ш + а)? 
t=1 
ы X Gi = Э) T = ») (ш + а) (ре E 


ізеті 


80 
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N—h 1— 
= Z . AC 200 жа y 
іші 
а. 
p 2; ==. Pure (и, + а) (Hy + Gy) 
ізеті 


k k 
М – 1 
= у=]. h È P: (ша + a) — Ж PP (p, + а) 
іші 1=1 
k 
E 25 Рр’ (из + а) (иу + а,) 
= 


Nur Tes: Y ) 
P ONDE қар P: ew] 

wey ils 

Wot" h РЕ 


EX 


| 


т 


k 
1 
=> пава (62) 


Substituting from (61) and (62) in (60), we get 
k 


vg.) =i Lassa? ет 


іші 


қ 
1 
із Жл Я (63) 


Тһе єргїп. (63) can be simplified. Adding and subtracting 
from it (1/h) F Рази, we may write 
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k 
А 1 
dedu" p Pi (SE 52+ Sia’) 
ігі 
1 k 
s h X. р, (ш + а, — и — ay 
іші 


k k 
ñ "We 1 
iu ДЕ Рі? — р P» PiS? (64) 
t=1 t=1 


Further, we note that the variance of an observation can be 
expressed as 


= 3p p (S+ SE + 5,5) + Б Saute — p — а) (65) 


whence 
k 
К 5,3 1 1 " 
v6.) = 5 + (па л) ғы (66) 
іші 
If we define 5,2 = 2 Pita? we obtain for V (y.) the same 


expression as in (26), as we would in fact expect. 


10a.7 An Alternative Expression for the Variance of the Sample 
Mean in Terms of the Covariance between Responses 


Obtained by the Same Enumerator 
Let Sy, denote the covariance between observations recorded 
by the same enumerator. Then, clearly 


EE (vu Se 0) (vv – E буд) i} 

EE((xic aj ey — к — а) Qe +a + er — e ё) |0 
= EE [(x, — н) (xv — p) + (а) — à? + e; ey, 

+ (a; — а) {(х,— p) + (x — и) 
+ (4; — 8) (ey + ee) 

+ є Ge — н) + er Gr — p] 


--5 + 5, (1 - и) 


|| 


Su 


| 
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since all other terms are zero, 


245,2 (67) 
Ғог 1агре М апа М. Substituting іп (26), we get 


(у= 152 + 5, [= = i (68) 


The formula in this form was first derived by Hansen, Hurwitz, 
Marks and Mauldin (1951). It is instructive since it shows that 
if errors of observation are unrelated from unit to unit so that 
Sy, =0, the variance is given by the usual expression for samples 
from a large population. It follows also that when an enumerator 
is given only one unit to enumerate, in which case both m =h 
and S,, = 0, the variance takes the minimum value, being given 
by the usual expression S,?/h. 


10a.8 Determination of the Optimum Number of Enumerators 


We have seen that the variance of the sample mean is decreased 
as the number of enumerators participating in the survey is 
increased. Practical considerations, however, place a limit on the 
number to which enumerators can be increased. Obviously, the 


termining this number js the 


When a sample of № units is randomly and equally distributed 
among уп enumerators, then it is reasonable to suppose that the 


cost of the survey will be represented by 
C = eh + ст + са hm (69) 
where 
€, is the cost of collecting information per unit, 
с; the cost of engaging an enumerator, 


and 


Cg proportional to the cost of travel on unit distance. 
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We shall further suppose that the money allotted to the survey 
is fixed at Су. To calculate the optimum values of h and т, we 
use the method of Lagrangian multipliers and form a function 


ф given by 


i aA 1 = 
ф = Js uS [ = x] + р (сй + ст + су ућт — Су) 
(70) 
where ш is a Lagrangian constant. 


Differentiating ф with respect to Л, т and и and equating to 
Zero, we obtain 


Я% 652,52 Сар fpa 

oh Bete te [a + 3 4 б Un 

оф 5,2 [ Сз оу 

= – == +и [в 45 zl 0 (72) 
and 

% = сћ + ст + са уйт — С = 0 (73) 


whence, multiplying both sides of (71) by Л and of (72) by т 
and eliminating м, we obtain 


Сз з .t3 pe а = 74. 
xx сді сік md (74) 
where 
с 52 
Pg om» Roos сез (75) 


An inspection of (74) shows that the equation has two real 
' roots, one positive and one negative. An explicit expression for 
the roots is, however, difficult to obtain and the solution has there- 
fore to be reached by the trial and error method. 

Two special cases of the cost function are of interest. When 
су is zero, the equation (74) reduces to a cubic 


Gag these «Әжі I. 76 
Ме ty Reus (76) 
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but here again an explicit expression for x is difficult to 
obtain. 
The other case of interest is when сз = 0. We get 


т = (77) 
ui cm 
Substituting for т from (77) in (73) after putting сз = 0, we get 
ал + esf = С, (78) 
or 
= oe (79) 
к= су + св 
and 
— Hike 
DE атар (80) 


Equation (80) shows, what is of course obvious, that the larger 
the contribution of 5,2 relative to 51°, the larger should be the 
number of enumerators participating in the Survey. On the 
other hand, it is likely that as the number of enumerators is 
increased, the difficulties of controlling the survey also increase, 


possibly resulting in a larger value of a. 


We shall not illustrate here the application of this result but 
shall refer to two more examples to gain some idea of the contri- 
bution of 5,2 relative to the total variation and the importance 
of controlling it in surveys, 


Example 10.3 


This relates to a socio-economic Survey conducted by students 
of the International Training Centre on Censuses and Statistics 
for South-East Asia during December 1949. The Survey was 
carried out in three villages: Badli, Shamapur and Auchandi, 
situated near Delhi (India). The houses in each village were 
serially numbered and grouped into blocks of three, A. certain 
number of these blocks was selected at random and within each 
block alternate households were enumerated. The sample for 
each village was divided into independent samples, one each to be 
enumerated by a different party of students. Thus, the work 
in the village Badli was divided among six parties of enumerators, 
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that in Shamapur and Auchandi among four and two parties 
respectively. The questionnaire used for the survey was prepared 
by the students themselves and included a large number of items. 
The results given here relate only to two characters, viz., the 
proportion of illiterates and the proportion of persons economically 
independent in a household. Table 10.3 gives the estimated 
values for each of the two characters for one village, Badli. The 
table shows that there is more variability in the estimates given 
by different parties in the character “economic independence” 
than in the other character. The estimated values of S,° and 
of the total variation, i.e., S4? + Sa? + S? are also given in 
Table 10.3. The relative magnitude of 5,2 as compared to the 
total variation is seen to be larger in the case of “‘economic 
independence" than in the case of the other character as expected, 
but not significantly so, being based on a small number of cases. 


TABLE 10.3 


Socio-Economic Survey in Badli 


Party A B C D E F 
% Illiterate 88 75 90 87 62 95 
% Economically independent 67 46 17 50 46 31 
Values of Sa? and 5, + Sa? + 52 
% Illiterate 0 574 
104 722 


% Economically independent 


Example 10.4 

This relates to the data on acreage collected in the course of 
a surprise check to which a reference has already been made 
in Section 1.8. Acreage under crops in India is compiled by the 
village accountants by noting the names of the crops field by 
field in the course of their administrative duties. As all the fields 
are surveyed and mapped and the area of each field (survey 
number) is, therefore, accurately known, the total area under any 
crop is obtained by simply adding the area of the fields growing 


that crop. 
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This method, though sound in principle, is not always free 
from errors in practice. In fact, there is a criticism that the 
village accountants do not exercise sufficient care in ascertaining 
the names of crops grown in the fields in their respective villages. 
A surprise check, was, therefore, organized in randomly selected 
villages of Lucknow District in Uttar Pradesh (India) to examine 
the extent of the inaccuracy of the records maintained by the 
village accountants. The check was carried out by the Statistical 
Staff of the Department of Agriculture, Uttar Pradesh, and of the 
Indian Council of Agricultural Research, New Delhi. 


Altogether 61 villages were selected for the purpose of this 
check. In each village 8 fields were selected at random. In 
each field the statistical investigator was asked to record the name 
of the crop. In case more than one crop was grown, he was 
asked to give the names of the crops together with the proportion 
of the area under each. The village accountants’ records for the 
same fields were taken from the register maintained by them. 
The check was carried out at harvest time after the village 
accountants had completed their inspection and made entries in 
the register. Table 10.4 summarises the results. 


TABLE 10.4 


Comparison of Crop-Acreages 


Gross Area (Sq.ft.) as recorded by 


Name of Crop Difference Percentage 
Village Statistical B . Б 
Accountants Staff 7—8 over. (3) 
(1) (2) (3) (4) (5) 
Wheat 22 3313224 3227149 86075 + 27 
Сгат .. 2391840 2415085 — 23245 — 1:0 
Barley - 1873291 2029875 —156584 Wi 
Arhar us 1888117 2103171 —215054 —10-2 


It is seen that the discrepancy varies from + 2.7 
to — 10:2% in arhar. Although 
to be serious or in the same direc 


% in wheat 
the discrepancies do not appear 


tion, they point to the need for 
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strengthening the supervision over the work of the field staff and 
the conduct of similar checks in other parts of the country. 


This conclusion is confirmed by treating the data by the methods 
developed in this chapter. We have 


k — 61, ту =Oorl, n-—p-—2,h,—8,m—2 


On analysing the data, it was found that S,? did not exceed 5% 
of the total variation in the case of any crop. 


10a.9 Limitations of the Method of Replicated Samples in Surveys 


We have seen that when a survey is arranged in the form of 
replicated samples, each one to be enumerated by a different enu- 
merator, we can estimate S, though we cannot separate p from а. 
We also saw that S,? may be considerable, depending upon the cha- 
racter observed and the conditions of the survey. In general, it may 
be said that for characters whose value is influenced by the judgment 
of the enumerator, like an eye-estimate of the yield of a crop, or for 
those for which it is difficult to obtain a precise answer, like farm 
expenditure, 5,2 may be considerable. 5," measures, as it were, 
the influence of enumerators on response and is likely to be 
particularly large in surveys in underdeveloped countries. 


A question naturally arises whether a sample should be 
randomly divided among enumerators participating in the survey 
as a matter of course in order to make possible the measurement 
of 5,2 and test its significance relative to the total variation. 


Obviously, the answer must depend upon the considerations of 
cost and of the precision with which it is possible to test the 
significance of 5,2. For апу single stratum the test of significance 
of S,2 is provided by 5е2/5ео?, but it does not have a good discri- 
minating power as the size of the sample allotted to any single 
stratum is normally small. In consequence, the test would most 
of the time fail to reveal the significant existence of 6,2 even 
when the differences between the enumerators’ estimates are large. 


One method of improving the discriminating power of the test 
is to use what are termed linked samples. In this arrangement, 
linked pairs (or groups) of sampling units are selected and one 
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sampling unit of each pair is allotted to one enumerator and 
the other to the second enumerator (Mahalanobis, 1944). This 
helps to reduce the variance of the difference between the esti- 
mates. On the other hand, it is easy to see from Section 7.6 
that the variance of the pooled estimate is increased by 
S,? (m — 1) p/n, where p is the intra-class correlation between the 
m members of linked samples, ; is the number of enumerators 
and n is the total number of units in the sample. There is thus 
a loss in efficiency of the combined sample estimate in relation 
to the procedure where linked Samples are not used. Any 
decision to use linked samples has therefore to be made after 


ascertaining the increase in expenditure on the survey for attain- 
ing the desired precision. 


Another method of overcoming the difficulty is to pool the 
results over all the strata and use Ste"[sec? to test for differential 
bias among enumerators. This test is more sensitive than the 
one for a single stratum but is of limited use as it does not help 
to locate the disagreement among enumerators, since for the 
location of disagreement the results of individual strata must be 
examined, but as stated in the previous paragraph the test of 


significance for individual strata will not reveal the differential 
bias most of the time. 


» the cost of the Survey will be increased 
roughly in proportion to (ут — 1). 
that the travelling cost per enume 
proportional to Ут, where р’ is th 
enumerated by each enumerator distr 
area. If travel cost forms a conside 
a survey, the loss in efficiency resultin 
appreciably large. 


174 


—* 
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These are the limitations one should bear in mind in using 
the method of replicated samples in surveys. Further, in inter- 
preting the results of replicated samples, there is a danger of 
guiding oneself into the belief that the fieldwork is under control, 
when in actual fact it may well be otherwise. An example will 
help to illustrate the point. 

This example is taken from a jute survey in Bengal (Mahalanobis, 
1944). The object of the survey was to estimate the area under 
jute. The survey was organized in the form of two samples A 
and B.. In each of the 379 strata into which Bengal was divided 
for purposes of the survey, the difference between the estimates 
A and B was tabulated and tested for significance. Table 10.5 
reproduced below gives the distribution of the values of ¢ and 
shows that in 109 out of 379 cases, the value of t was significant, 
although the expected number of significant cases at this level 
is only 19. 

TABLE 10.5 


Comparison of Samples A and В; Students t for Strata 


Range of Number of Cases 


паше Observed Expected Ma m x 
Less than 0-05 109 18-95 +90-05 427-92 
0:05-0-10 20 18:95 + 1:05 0:06 
0:10-0-90 235 303:20 —68-20 15-34 
0:90-0-95 12 18:95 = 6:95 2:55 
0-95-1-00 3 18:95 -15:95 13-42 
Total .. 379 379-00 459-29 


In order to find an explanation for this large difference a 
scrutiny of the field records was made by the author of the 
survey and showed that in 84 out of the 109 strata, the discre- 
pancy could be ascribed to the influence of real physical differences, 
such as weather conditions during the periods in which the work 
of the 4 and B samples was carried out. Omitting these 84 cases, 
there are 295 left. The distribution of the remaining 295 values 
of t was found to be in satisfactory agreement with the expected 
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„distribution of 2. It was concluded that * the object of using 
the replicated sampling method was entirely successful". 


This interpretation, however, raises a logical difficulty. Once 
discrepant work is suspected, it would appear only proper to 
scrutinize the work in all the strata and not confine the scrutiny 
to only those having significant values of ¢. For, in a stratum 
where / is non-significant, one can also expect discrepant work 
since the non-significance can be due to the opposite effects of 
the discrepancy in work and the real physical differences between 
the two samples A and B. When the sample size is small, as it 
will usually be in each stratum, this method may lead one to 
looking for trouble where it does not exist and vice versa, as it 
is likely that real large differences may be declared non-significant 
and vice versa. There also arise practical difficulties in going 
back to the sampling units for the scrutiny required after the 
survey is over. The whole procedure of accepting the verdict 
of agreement where 1 is non-significant and explaining the differ- 


ence in terms of physical differences where 7 is significant, is 
logically untenable. 


Even apart from considerations of cost and interpretation, a 
random allotment of the sampling units among enumerators 
cannot by itself be an effective tool for the control of fieldwork, 
and the need of controlling it in other ways is obvious. This 
need it would appear is best met by providing adequate and 
effective supervision over fieldwork. We shall conclude this dis- 
cussion by examining the roles of supervision and replication at 
the primary level. The two differ in several respects : 


(i) Supervision is carried out b 
qualified and experienced 
at the primary level em 


y the superior staff, better paid, 
as compared to the enumerators 
ployed in replicated samples. 


(ii) It is carried out on a part of the work performed at the 


primary level whereas replicated samples require at least 
two independent samples. 


(iii) Supervision is not confined only to enumerating for the 
second time units once observed at the primary level. 
It has a wider objective in view, namely that of correcting 
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and improving the fieldwork on the spot, whereas repli- 
cated samples will usually suggest the need for improve- 
ment when the survey is over. 


(iv) A supervisor need not be present throughout the operations 
connected with the enumeration of a selected unit, 
whereas an enumerator under sample survey must enu- 
merate completely every unit assigned to him. 


(v) Units selected for supervision may or may not be selected 
by the principle of random sampling, whereas in repli- 
cated samples they will necessarily be so selected. When 
it is possible to arrange supervision on а probability 
basis and the work done by the supervisors is considered 
a sub-sample of the work done at the primary level, 
supervision may be considered a very special form of 
replicated samples subject to the differences mentioned 
above. This way supervision can be utilised to improve 
the estimates obtained from the work done at the 
primary level. 


(vi) Replicated samples will not reveal minor defects in an 
investigator and will certainly not reveal faults which are 
common to all the investigators, whereas this is possible 
with supervisory checks. 


(vii) Replicated samples alone can estimate observational errors 
whereas supervision will not, unless conducted as visual- 
ised in (v). 

It would be seen that supervision can provide a better control 
over fieldwork in a variety of ways which is not possible in the 
case of replicated samples. Replicated samples are no alternative 
to supervisory check, though the latter can be.  Replicated 
samples have a place either when the object of the survey is to 
compare different methods or different classes of investigators, or 
at the pilot stage of a large-scale survey for testing questionnaires 
and procedures, but would hardly appear worth while for adoption 
as a regular feature of surveys. 
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B. INCOMPLETE SAMPLES 
1061 The Problem 


It is common experience that some of the units selected in the 
sample do not respond, at least at the first attempt, and indeed 
may not respond even after repeated attempts. Thus the selected 
farmers or families may not be found at home at the first attempt 
and some may refuse to co-operate with the interviewer even if 
contacted at the second attempt. Persuasion and further attempts 
are therefore invariably required for achieving completeness. 
This, however, increases the cost of the survey. On the other hand, 
estimates based on the incomplete sample may be biased. This 
extent of incompleteness, called non-response, is sometimes so 
large as to completely vitiate the estimate. The problem is 
particularly important in interview surveys. In the following 
section, we shall give the solution of the problem as first put 
forward by Hansen and Hurwitz (1946), which consists in drawing 
a sub-sample of non-respondents and enumerating it completely 
through later attempts, the size of the total sample and that of 
the sub-sample in the non-response group being so determined 
as to give an unbiased estimate of the population value with the 
desired precision at minimum cost. 


10b.2 The Solution of the Problem of Incomplete Samples 


We shall suppose that the population can be divided into two 
classes, those who will respond at the first attempt and those 
who will not. For convenience we shall call the two classes as 
the response and non-response classes. If пу units in the sample 
respond and л, do not, then we may regard п, a random sample 
of the response class and п» a random sample of the non-response 
class. Let h, denote the size of the sub-sample from л» to be 
enumerated at the second attempt, such that 

n = fhe (81) 


Further, let №, and N, denote the sizes of the response and the 
non-response classes in the population. Clearly, N, and М, cannot 
be known and can only be estimated from the sample. We have 


N 
Est. N, = a 
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апа 
nN 


Est. М» = 
n 


The cost of the survey will be made up of three parts as 
follows: 


С = сп + су + су (82) 


where cy represents the cost of locating a sample unit at the 
first attempt, c, the cost of enumerating and processing informa- 
tion per unit in the response class, and c, the cost of enumerating 
and processing information per unit in the non-response class. 
This cost will obviously vary from sample to sample. We shall 
therefore consider the average cost of the survey. Substituting for 
пу and п» their expected values, we get 


= Мк 83 
С = у (Ма +Ма +7 в} (83) 
Clearly, the estimate of the population mean is given by 
у Ne (а), + n») 
io = In, + Ат = (84) 


It is easily shown that this gives an unbiased estimate of the 
population mean, for, . 
E (тј, | n) = E {E Qn, |n» n) 
= Е (три, | п} 


__пМуўм, 


N 


and 

E (nJ, |1) = E E E (пој, |) Ур «uo n) 
E E (ny), | Mos n) 
E (now, | 


\ 


пӯк, 
N 
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whence 


Eg) = Uu Non) 


= Je . (85) 


To obtain the variance of Jọ, we have 
[в quc MA, I} 
n 


The right-hand side may be written as 


-g (Ur, трње у Y 
= z HN 


= (Е quercum =. ын Nn. — TATUM 
n n 


E(o.— + 6,—5)] 


^ ee ae 3 
= Е lo. Jay + 2 On, — Fn? 


Опа /- = Ava = 
tua -»)09,-»)) (0 
We know already that 
В ы 1 
Ед, — In = („— у) 5 (87) 
where S? is the mean square for the whole population. 


To evaluate the second term in (86), we first have 


E Or, — 4 = E{E On, — 9, | ho, Vas s Ум) 


= 1 1 d ; 
55-40 е 


where 


= E 0. — In) 


2 
5; 
п — 1 
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so that 
2 БЕСІ. 1 1 а > 
E Or P) = (= mra 8,2 5 (88) 
where 
pow 
S} = Noci 2 (у; — Ju 
E іші 
Непсе 
ШЕ ои — пе ү 1 Ж 
А Е tu 0), ыз} = Е fa (5; п; sz} 
E 1 ° 
= "m Е (1,5; } 
Bir Ni. 
= 52. № 8, (89) 


The value of the third term іп (86) is clearly zero. Hence, from 
(86), (87) and (89), we get 
ку. ДА Ge f=! Ne ge j 

уби = (6 =) в Say (90) 
If f =1, the second term of (90) will vanish and we shall be 
left with the variance of the mean of a simple random sample of 
n as we would expect. The second term represents the increase 
in the variance arising from sub-sampling Л» out of лп» units. 


We shall now proceed to determine the optimum values of л 
and f. Let 


C t р (V — У) 


< 
| 


| 


N: N—n 
у (Ма + Ма + > а} +e {fas 


+ 3207 DSe — ro} 01) 


where V, is the value of the variance with which it is desired 
to estimate the mean. Differentiating with respect to n and f, 
| 31 
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and equating to zero, we obtain 


1 М» "EM 2 М» 2 

у (Ма + Ма + та} = к [8 а-в) o» 
апа 

по; = 688 

S 


(93) 


On substituting for 1/n? from (93) in (92), we obtain 


2 {Neo + Му + 5; в} = jaa (5 +% (/-1) se} 


which reduces to 


+ Ney ЖУО = "E (s Na 52) 


N 
Hence 
Ar (94) 
From (92), we obtain 
No 
NERIS 
95 
(хе, + Мус, + T ca) 99 


Now to find џ we note that 


ЕЕ а се 
Yo Nn + 


М-п № f—1 
М п 
ог” 


(n+) =} {94% "А =D sz} du 
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On substituting for n? from (92), we therefore have 


- (м + Мус, +7 в) | 


SA 


2 № 2 
м (Yo + 5 А 


в 


whence, оп substituting the result in (95), we finally have 


s+ (/— 152 
(97) 


В ES 
5° 
BoT N 
Equations (94) and (97) thus provide the values of п and f required 
to estimate the population mean with the desired standard error 
at the minimum cost. 
An example will serve to illustrate the method. Suppose the 


response rate is 50% and 5,2 for the non-response group is 4/5 
of that in the whole population. In other words, 


AN T UNS д, 
poca m 
and 
45° 
Xi 


Ignoring the finite multiplier, we then obtain from equation (90), 
the variance of the estimated mean as 
В S: + 27) 
сз шыш 9 

V6 5n (98) 
To work out the cost of the survey let us assume that it costs 
one rupee to contact a unit, four rupees to enumerate and 
process information on that unit and eight rupees to enumerate 
and process information on the unit in the non-response group. 
The total cost of the survey is, therefore, given by 


С =" (з +7) 
31а 


(99) 
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Now let us suppose that we wish to estimate the mean of the 
population with a desired variance V, equal to, say, S?/100. 
Substituting in (98), we have 


342f— а (100) 


Table 10.6 sets out for different values of f, the values of n 
obtained from the above equation and those of the cost of the 
Survey obtained on substituting the values of n and f in (99). 
The expected value of Л» is also given in the table. It will be seen 
from the table that the cost is the same for (1) f£ —L, п = 100, 
and (1) f —2 and п =140. For values larger than f =2 and 

= 140, the cost is higher. The most economical sample would 
therefore seem to lie between 100 and 140 with f between 1 and 2. 


TABLE 10.6 


Values of n and f which Provide Estimates of the 
Mean of the Same Precision 


F n б Е (h) 
1 100 700 50 
2 140 700 35 
3 180 780 30: 
4 220 880 27 


The optimum values of п ап 
from equations (94) апа (97). 
п = 116. 


а / сап also be directly obtained 
Thus f is found t6 be 1-4 and 
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